SAP Data Intelligence Cloud : How to use R Kernel ...

VinayBhatt · ‎12-14-2020

You might already know that you can build complex Machine Learning scenarios as part of your Data Value Journey with SAP Data Intelligence . If not then look at the blog post by andreas.forster and ingo.peter on how to create your first ML scenario in SAP Data Intelligence using Python & R :

SAP Data Intelligence: Create your first ML Scenario

SAP Data Intelligence: Create your first ML Scenario with R

In this blog post you will learn how to start R Kernel inside Notebook in SAP Data Intelligence Cloud to perform Data exploration and Free-Style Data Science and also to test R code before using it in the R Operator available in pipeline modeler.

Although R Kernel is not yet officially available in SAP Data Intelligence Cloud notebook (as of version 2010.29.9) , this blog post will guide you to install one using "conda" on python and will also present a way to load data sets stored in DI DATA LAKE using R library - "reticulate".

Launch Notebook inside your ML SCENARIO

Launch a Notebook session inside your ML Scenario:

Select Python 3 Kernel :

Install R Kernel

Use below "Conda" command to install the R-Kernel :

Note -y option here is for the silent install. (else we get prompted to enter "y"/"yes" to begin the install)

!conda install -y -c  r r-irkernel

All required/dependent packages will be installed and the final summary is as below :

Install Required R Packages

Before you begin using the newly installed R kernel, remember that the base R install won't have all packages that you need ,for example look at below R code snippet requiring dplyr, sqldf,readxl etc

library("sqldf")

library("dplyr")

library("cluster")

library("tidyverse")

library("reshape2")

library("excel.link")

library("RODBC")

library("randomForest")

library("car")

Let's first learn how to install such R packages.

Execute below command in the python 3 kernel notebook session (same as the one on which we installed R-kernel) to install "dplyr" and this is same as doing install.packages("dplyr") on the R kernel :

conda install r-dplyr

Follow the same for other required packages:

conda install r-reshape2

conda install r-RODBC

Launch a New Notebook with R Kernel

Select the Newly installed R-Kernel:

Or use the Launcher to create a new notebook with R kernel :

Let's follow the IRIS R Example to use the R Kernel :

Load the Data:

library(datasets)

data(iris)

summary(iris)

names(iris) <- toupper(names(iris))

library(dplyr)

A basic Filter:

setosa <- filter(iris, SPECIES == "setosa")

head(setosa)



sepalLength5 <- filter(iris, SPECIES == "setosa", SEPAL.LENGTH > 5)

tail(sepalLength5)

Loading Data from a local CSV ( CSV uploaded in the Jupyter lab session) :

SFO landings Data Reference: https://data.sfgov.org/Transportation/Air-Traffic-Landings-Statistics/fpux-q53t

Loading Data from SAP DI Data Lake

We will make use of R library "reticulate" (library(reticulate)) to access artifacts stored in DI DATA LAKE using a python script , again this is a workaround which lets us access the shared artifacts.

First install the R reticulate library using the python 3 kernel with conda as below :

conda install r-reticulate

Create a new Text file to create a python script and insert below code cell which imports python packages ( sapdi , pandas) , and further defines a function to read a file from DI data lake :

import pandas as pd

import sapdi



def read_FILE_DL():

	ws=sapdi.get_workspace(name='Air_Traffic_SFO')

	dc=ws.get_datacollection(name='LANDINGS')

	with dc.open('Air_Traffic_Landings_Statistics.csv').get_reader() as reader:

		landings = pd.read_csv(reader)

	return landings

Now Open the Notebook with R-kernel ,Restart the R Kernel before running below commands.

use below commands to load the Data :

library(reticulate)

source_python("load_data.py")

landings <- read_FILE_DL()

landings

Using the "MARATHON" dataset example

Refer Ingo's Blog post and for Dataset use the blog post : SAP Data Intelligence: Create your first ML Scenario

df_train <- read.csv(file="RunningTimes.txt", header=TRUE, sep=";")

 head(df_train)



ID	HALFMARATHON_MINUTES	MARATHON_MINUTES

1	73	149

2	74	154

3	78	158

4	73	165

5	74	172

6	84	173



str(df_train)



'data.frame':	117 obs. of  3 variables:

 $ ID                  : int  1 2 3 4 5 6 7 8 9 10 ...

 $ HALFMARATHON_MINUTES: int  73 74 78 73 74 84 85 86 89 88 ...

 $ MARATHON_MINUTES    : int  149 154 158 165 172 173 176 177 177 177 ...





lm.fit<-lm(MARATHON_MINUTES~HALFMARATHON_MINUTES,data=df_train)

str(lm.fit)

List of 12

 $ coefficients : Named num [1:2] -6.01 2.25

  ..- attr(*, "names")= chr [1:2] "(Intercept)" "HALFMARATHON_MINUTES"

 $ residuals    : Named num [1:117] -9.18 -6.43 -11.43 6.82 11.57 ...

  ..- attr(*, "names")= chr [1:117] "1" "2" "3" "4" ...

 $ effects      : Named num [1:117] -2361.82 294.44 -9.94 8.49 13.21 ...

  ..- attr(*, "names")= chr [1:117] "(Intercept)" "HALFMARATHON_MINUTES" "" "" ...

 $ rank         : int 2

 $ fitted.values: Named num [1:117] 158 160 169 158 160 ...

  ..- attr(*, "names")= chr [1:117] "1" "2" "3" "4" ...

 $ assign       : int [1:2] 0 1

 $ qr           :List of 5

  ..$ qr   : num [1:117, 1:2] -10.8167 0.0925 0.0925 0.0925 0.0925 ...

  .. ..- attr(*, "dimnames")=List of 2

  .. .. ..$ : chr [1:117] "1" "2" "3" "4" ...

  .. .. ..$ : chr [1:2] "(Intercept)" "HALFMARATHON_MINUTES"

  .. ..- attr(*, "assign")= int [1:2] 0 1

  ..$ qraux: num [1:2] 1.09 1.18

  ..$ pivot: int [1:2] 1 2

  ..$ tol  : num 1e-07

  ..$ rank : int 2.



///

SUMMARY

Its often a need in projects to quickly test R Code snippets before we can use them in Data Pipelines, so I hope that above workaround to have a working R kernel in Notebook within SAP Data Intelligence will be helpful.

I would also reiterate that the solution above serves as a workaround to test some of the R code snippets in notebook in SAP DI, and that the R SDK for SAP DI is not yet available ( unlike Python SDK for SAP DI) hence the artifacts stored in SAP DI data lake cannot be directly referenced using R kernel among other limitations( e.g. as is done using "sapdi" python library in SAP Data Intelligence)

Please follow the tag SAP Data Intelligence to get notified on latest updates, or visit the SAP Community page for SAP Data Intelligence to keep learning more: https://community.sap.com/topics/data-intelligence

AndreasForster · ‎12-15-2020

That's excellent tin1912, great to know that this is possible!

I tried it out and got the Marathon time forecast working with R in the DI Notebook

Dokon · ‎12-20-2020

It's useful & nice blog post. Thank you, Vinay.

felixteched · ‎05-26-2021

Hello Vinay, that's really good to know but after executing the steps i am always getting the following error, even after i have restarted the kernel or created a new notebook from scratch. Any idea?

The kernel for 057596fe-f23e-44bc-aac1-185a427834bd/notebooks/10_Loading_Datasets.ipynb appears to have died. It will restart automatically.

Thanks

VinayBhatt · ‎05-29-2021

Hi,

Thanks for your comment, you may try restarting the Jupyter lab application from the "System Management" Application :

Thanks!

felixteched · ‎05-31-2021

Hi, Thanks for the answer, but I have tried several times to do it but after restarting the Jupyter Lab the kernel R dissapear and I have to execute again the steps to install the R kernel. the output of the conda info is as follows:

     active environment : None

       user config file : /home/labuser/.condarc

 populated config files : /opt/conda/.condarc

          conda version : 4.7.10

    conda-build version : not installed

         python version : 3.7.3.final.0

       virtual packages : 

       base environment : /opt/conda  (writable)

           channel URLs : https://conda.anaconda.org/conda-forge/linux-64

                          https://conda.anaconda.org/conda-forge/noarch

                          https://repo.anaconda.com/pkgs/main/linux-64

                          https://repo.anaconda.com/pkgs/main/noarch

                          https://repo.anaconda.com/pkgs/r/linux-64

                          https://repo.anaconda.com/pkgs/r/noarch

          package cache : /opt/conda/pkgs

                          /home/labuser/.conda/pkgs

       envs directories : /opt/conda/envs

                          /home/labuser/.conda/envs

               platform : linux-64

             user-agent : conda/4.7.10 requests/2.23.0 CPython/3.7.3 Linux/5.4.0-5-cloud-amd64 sles/15 glibc/2.26

                UID:GID : 1000:100

             netrc file : None

           offline mode : False





Best Regards,