Installing Python packages from tarball/zip files ...

stojanm · ‎09-27-2019

To guarantee for a Python script to execute properly, custom environments need to be created, where the required modules and packages are specified. Pip, a tool allowing the search in the Python Package Index repository (PyPI), simplifies this process significantly. In addition to its standard functionality, it also allows the installation from archive and wheel files.

In this blog post, I will demonstrate how to create in SAP Data Intelligence (SAP DI) a Docker file for Python environment including packages located in archive files. As an example we are going to use the hana_ml package. If you want to learn more about this package and how to download it, please consult this blog post. Additional installation details with HANA Express can be found here.

Uploading the hana_ml archive into the DI System

In order to be able to install from a local archive (tarball/zip) file during the Docker definition step, the archive needs to be uploaded into DI. We are going to look into the following two options to achive this:

Through the DI System Management

Using the Modeller Repository directly

With the first method, we open DI System Management from the DI Launchpad and proceed to the Files tab. Then, we navigate to the path files->vflow->dockerfiles and choose a location for the new Docker file:

With a click on Import File from the menu (top right side), the file can be located on the hard drive and selected:

After a refresh of My Workspace section (first button on the left) you should be able to see the newly uploaded file in your selected path.

The procedure to upload the archive file using the Modeller directly is similar. From the modeller, the Repository tab is selected and using the Import File menu, the archive file is uploaded into the destination folder, as shown below:

Remark: Since the import function is currently used to import solutions into DI (packed as tar.gz archives), it would automatically unpack all provided archive files. For that reason, you would want to simply rename your archive file by removing the ending (e.g. from hana_ml-1.0.5.tar.gz into hana_ml-1.0.5). Once uploaded, it is up to you whether you rename it back or leave it as it is.

Creating and building the Docker file

The process of building a Docker file in SAP DI has been described extensively in this excellent blog post. To avoid repeating those steps, I will continue directly with the new file definition, which is simplified and aims to only demonstrate the required lines of code:

FROM $com.sap.python27

COPY hana_ml-1.0.5 hana_ml.tar.gz

RUN pip install hana_ml.tar.gz

The first line specifies the inheritance path for our Docker file. With the copy command, we specify that the uploaded local archive file needs to be copied into the Docker container (a rename is also taking place, for easier reference). In the next step, pip is used to install the local file into the container environment.

Finally, the tags of the Docker file need to be updated, so that it can be used in custom operators or groups during the pipeline (graph) creation process. The tags I have selected here are:

The first three tags are internal requirements of DI, since the Docker file inherits from Python2.7. The last one was chosen by me to refer to this Docker environment. It will be used in the next step, during the pipeline creation process, to tag the Python operators, which require this environment to ensure the successful execution of the corresponding Python scripts.

Specifying the new Docker environment in a pipeline

The last step in the process is to link the newly created Docker environment with the pipelines, using the tags.The example pipeline created here uses OpenAPI to expose an APL model stored in HANA for scoring. The idea is that the user calls the API endpoint with the data to be scored and the Python operator uses hana_ml to load the model from the repository and to apply it to the new data set. Finally, the results are sent back to the requester and at the same time shown in Wiretap for debugging purposes.

The contents of the Python operator are as follows:

from hana_ml.algorithms.apl.classification import AutoClassifier



def on_input(data):

    # create a data frame from the input message (skipped)

    ...

    # create connection context

    conn = dataframe.ConnectionContext(address='xx.xx.xx.xx',port='00000',user='MYUSER',

                                       password='MYPASS',encrypt='true',

                                       sslValidateCertificate='false')

    

    # load the model from the HANA repository

    model2 = AutoClassifier(conn_context=conn)

    model2.load_model(schema_name='MY_SCHEMA_REPO', table_name='DI__APL_MODEL')

    

    #apply the model

    applyout2 = model2.predict(data.body)

    outmsg = applyout2.collect()

    

    # prepare the output message from the dataframe (skipped)

    ....

    api.send("output", outmsg)



api.set_port_callback("input", on_input)

Please pay attention that the Python operator was added to a group called Hana ML. This allows us to specify the tag selected during the previous step (hanaMLdocker) in the Group Settings, and thus to link the Docker environment with this operator:

Now, the Python script, stored in the operator, will be executed in the right Docker container, providing the environment we defined during the Docker file creation process above. This ensures, that the package hana_ml can be imported and the execution will be successful.

Note: If you want to use the Configuration Manager to handle the HANA connection (and not hardcode the credentials in the Python script) please check out this great blog post.

I hope this will be helpful to someone, trying to use pip with local archive files. By the way, the same procedure is valid also for preparing a custom R environment and using custom made R packages.

Thanks,

Stojan

former_member89766 · ‎09-29-2019

Great post Stojan. Thanks for putting this together with the specific details!

former_member584732 · ‎09-30-2019

Thanks Stojan - the level of detail is very helpful!

henrique_pinto · ‎10-02-2019

Hi Stojan,

good example on how to install non-pypi python libs.

I also particularly liked your approach on how to expose the model inference task as a RESTful API with the OpenAPI Servlow operator. You should explore that more in another blog.

One comment I’d make, though, is that while it’s fine for prototyping, in productive deployments you probably don’t want to have the .whl files directly uploaded to the vrep, since it will be copied for each user and you don’t have any control on versions etc. I’d recommend using a local nexus repo if you have libs that are not in pypi or if you don’t have internet access from your DH cluster. This blog by Remi explains that concept:
https://blogs.sap.com/2019/08/15/using-sap-data-hub-without-internet-access/

Cheers,
Henrique.

stojanm · ‎10-04-2019

Thanks for your comment, Henrique and for recommending Remi's great blog post. Very interesting to see his approach on how to build your own offline repo, which I agree will be more suitable for a productive environment.

Thanks,

--Stojan

bhaswanth · ‎03-21-2022

Hi,

Can you please help me in installing notebook-hana-connector library in Python Operator in modeler, i.e .Can you please help in providing the docker file in which I can install notebook-hana-connector package.

stojanm · ‎03-21-2022

Hi Bhaswanth,

thanks for your comment. The notebook_hana_connector module is available in the Jupyter environment in SAP Data Intelligence and it can be used without additional installation. The advantage of this module is that it allows to access connection objects defined in DI programmatically.

If you want to access connection objects in Python Operators, the recommended way is to create a custom operator and define managed connections as described in this blog post. Once this is done you can access the connection objects the same way as via the notebook_hana_connector, e.g.:

hanaConn = api.config.hanaConnection['connectionProperties']

conn = hana_ml.dataframe.ConnectionContext(hanaConn['host'], hanaConn['port'], hanaConn['user'], hanaConn['password'])

More details on how to define config.hanaConnection you will find in the blog post's section "Custom Operator – Script".

Please let me know if this helps. Thanks!

bhaswanth · ‎04-20-2022

Hi stojanm ,

I could make use of notebook_hana_connector package in Python Operator by using the base image (FROM $com.sap.sles.jupyter) mentioned by andreas.forster in his blog

https://blogs.sap.com/2021/04/08/hands-on-tutorial-script-and-deploy-python-with-the-new-jupyter-ope....

I also tried the approach you suggested, however in our case we cannot explicitly connect using hana_ml or hdbcli packages because of the following reason(I think so)

The HANA_DB connection setup in Connection Management uses SAP Cloud Connector in Gateway parameter.

So maybe we might need to send gateway details while creating connection object.

Could you please validate my understanding?

Thanks,

Bhaswanth

bhaswanth · ‎04-20-2022

Hi Stojan,

I could use notebook_hana_connector by using jupyter base image as mentioned in Andreas Forster's blog : https://blogs.sap.com/2021/04/08/hands-on-tutorial-script-and-deploy-python-with-the-new-jupyter-ope...

I tried the approach you recommended but I guess we cannot get conn object from the HANA_DB using hana_ml or hdbcli in my case as the HANA_DB connection setup uses SAP Cloud Connector in Gateway parameter.

We might to need provide gateway details also while creating connection object.

Please validate my understanding here?

Thanks,

Bhaswanth