Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
BJarkowski
Active Contributor
The use cases for the Hadoop and SAP HANA are different, but in some cases, it is recommended or even required or to connect both worlds. The Hadoop clusters operate on very large datasets that consist of unstructured data while SAP HANA is focused on analytical processing. An ETL process that would replicate the Hadoop data is not always a good solution as it could dramatically increase the database memory requirements. But a smart usage of both technologies benefits from lower storage cost and high-speed in-memory processing power at the same time.

There are a few ways to expose data coming from a Hadoop cluster to the SAP HANA database. Some time ago I wrote two blog posts where I describe how to use SAP Vora and the Spark Extension to query the data from the Azure HDInsight and from the Azure Databricks in SAP HANA. But the SAP Vora is not required. SAP HANA can be directly connected to the Hadoop cluster using just ODBC drivers or the HANA Spark Controller. Today I present the first way, but don't worry, a blog about the Spark Controller is on the way!

In previous episode I have already presented how to deploy the Azure Databricks cluster, so this time I jump directly to the configuration part. If you don’t have a cluster already, I'd recommend reading the Part 17 firstly.

ODBC DRIVERS

I use the unixODBC as the Driver Manager. It was not installed by default on my server, so I use Yast to add the missing component:



The Simba ODBC drivers that I’m going to use today are available on the Databricks website. To access the download link you’ll have to provide your details like name or e-mail address. After around 10 minutes I had a new e-mail in my Inbox.



The SAP HANA database runs on Linux environment. I downloaded the 64-bit version and extracted the archive.
unzip SimbaSparkODBC-2.6.4.1004-LinuxRPM-64bit.zip
cd SimbaSparkODBC-2.6.4.1004-LinuxRPM-64bit/
dir



The RPM installation is also straightforward
rpm -ivh simbaspark-2.6.4.1004-1.x86_64.rpm



By default the package is extracted to the /opt/simba/spark directory:



The driver library libsparkodbc_sb64.so is located in the /opt/simba/spark/lib/64

CONFIGURE ODBC DRIVERS

To configure the drivers we’ll need an input from the Databricks cluster. It exposes the ODBC protocol and the connection details are available under Clusters -> Configuration -> ODBC



To establish the connection we’ll also require a personal token that can be generated in the Databricks dashboard. Click on the User Settings under the profile icon in the top right corner:



Click on Generate New Token and choose a validity period. The token will be displayed on the screen – remember to copy it as it is not possible to display it again.



The ODBC configuration relays on two text files that should be kept in the home directory of the <sid>adm user. The first one can be copied from the driver location and stores the driver settings. There is no need to change anything inside.
su - <sid>adm
cd $HOME
cp /opt/simba/spark/lib/64/simba.sparkodbc.ini .simba.sparkodbc.ini



The second file keeps the connection information. Create the .odbc.ini file with following content:
[Databricks]
Driver=/opt/simba/spark/lib/64/libsparkodbc_sb64.so
Server=<Server Hostname from the Databricks ODBC settings>
HOST=<Server Hostname from the Databricks ODBC settings>
PORT=443
SparkServerType=3
Schema=default
ThriftTransport=2
SSL=1
AuthMech=3
HTTPPath=<HTTP Path from the Databricks ODBC settings>



The unixODBC uses the system variables to identify configuration files. The easiest way to modify them is to create the .customer.sh file located in the home directory, that will be automatically executed during the database user login.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/simba/spark/lib/64
export ODBCINI=$HOME/.odbc.ini
export SIMBASPARKINI=$HOME/.simba.sparkodbc.ini



Log out and log in again to check the current unixODBC configuration. User data sources should point to the .odbc.ini file that we created.
odbcinst -j



TEST THE CONNECTION

I think this is a good moment to stop for a while and check if the connection to the Databricks works. Sometimes the unixODBC can be tricky so I want to ensure I configured it correctly before I do anything in SAP HANA. But to test the connection we need some data in the Databricks tables.

There are two ways to create a table in the Azure Databricks. You can either convert a DataFrame into a table or use the user interface to import a new file using a browser.

In the previous post I presented how to import data residing in the Azure Data Lake into a DataFrame. We can just add one line of code to save it as a Table:
val df = spark.read.option("inferSchema", true).csv("dbfs:/mnt/DemoDatasets/airports.dat")
df.show()
df.write.saveAsTable("Airports_DF")



The alternative way is also very easy. In the Databricks dashboard click on Data in the left menu and Select Add Data. On the screen below you can also see the table I just created using the script.:



The dataset I’m using is available here:
https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat

Now you can just drag and drop the file to the browser window and adjust parameters like table name or table headers:



Now, when we have some data, we can query the Databricks cluster using the ODBC drivers.
isql <DSN> <username> <password>



If you can see the SQL command prompt it means the connection is established successfully and we can run a sample query:
select airport_name, city from airports_dat limit 10



The connection is working fine and we can continue with the HANA configuration.

CREATE A VIRTUAL TABLE IN SAP HANA

Today’s post goal is to access the Databricks cluster from SAP HANA database. We have already verified that the connection is working properly, so the last task is to create a new Remote Source and reference the Databricks table as a Virtual Tables.

Create a new Remote Source and provide the ODBC DSN together with user credentials:



Now you can expand the Remote Sources to see the available tables in the Azure Databricks:



Create a virtual table:



Now you can query it as it was a part of the SAP HANA database

15 Comments
0 Kudos
Thanks for sharing this - very timely (as you know :))

 

Are you able to demonstrate how this works if you establish your data source using Azure Databricks using Delta instead of a standard Parquet approach as I believe it should be quite similar but yield much better performance and efficiency.

 

Also do you have any guidance for how complex a data model / data sourcing you can support from your source data and the influence it may have on cost of Azure Databricks running e.g. would this be good for running your spark cluster 24x7 or more to cache your data into a BW for your reporting and then using something like Databricks Delta to append the updates (either batch or streaming) into the BW/4HANA stack.
BJarkowski
Active Contributor
0 Kudos
Hello Chris,

I run a test and I was able to expose the Delta tables using ODBC drivers.

The challenge I see in your approach is that it's very close to a regular ETL process. The Smart Data Access is designed to work with on-line data sources to provide (near) real-time data without the need of copying them.

I'm not an expert in BW, so not really sure what options do you have to work with the cache, but if you want to keep the data in BW and update it with changes, then what's the advantage of this solution over a regular ETL?

To access the data on-line you'd require the Databricks cluster running. But if you could use a set of small VM, it doesn't neccessarily has to be expensive, especially when comparing to a price of HANA VM with additional memory. I'd also check the current availability of the Databricks cluster - maybe it's already running most of the time. If you're able to analyze at what time users access the data, maybe you could schedule the automatic start and stop based on time?
0 Kudos
Hello,

Thanks for this wonderful blog. Extremely detailed and easy to follow.

Can you also share some insight on how can we connect to Azure Databricks from an on-premise HANA DB?

Are there some options of connecting via a proxy host or connect via some gateway system like SAP Cloud Connector or Azure Data Management Gateway?

Thanks and Regards

Shubra

 
0 Kudos

Great article! Any thought when you are planning to release the article on Spark Controller integration with Databricks?

BJarkowski
Active Contributor
0 Kudos
Hi,

probably couple of weeks. I'll try to write it in October. Next one I plan to release is to connect from Databricks to HANA using JDBC.

 

 
0 Kudos
Great article,  Bartosz!

If we use a standalone DPagent on windows server, I assume we just need to install the windows odbc driver and configuration the DSN there. right?

Also, have you compared the performance between odbc and spark controller?

Which connection type is recommended?

 

 

Regards,

Aaron
mayank_gupta01
Advisor
Advisor
0 Kudos
Hi Bartosz,

Have you written any blog for connecting databricks to hana using jdbc ?
BJarkowski
Active Contributor
0 Kudos
Hi,

there is a couple of different scenarios I covered, so have a look at my blogs:

bartosz.jarkowski4#content:blogposts

For example:

https://blogs.sap.com/2019/07/18/your-sap-on-azure-part-19-expose-azure-databricks-to-sap-hana-using...

https://blogs.sap.com/2019/10/24/your-sap-on-azure-part-22-read-sap-hana-data-from-azure-databricks/
0 Kudos
Hi Bartosz,

thx for your great blog.

Did you also consider using Databricks JDBC driver on top of SDI CamelAdapter? If yes, was there any K.O. criteria, why you didn't try that one?

Best regards Stefan
steffen_mezger
Explorer
0 Kudos
Thank you for this detailed article.

can you confirm that the connection is still working with the latest drivers and hana version?

SimbaSparkODBC-2.6.26.1045-LinuxRPM-64bit.zip (SLES Linux)
SAP HANA 2.0 SPS05

In my installation, the indexserver is crashing when i try to expand the Remote Sources to see the available tables
BJarkowski
Active Contributor
0 Kudos
Hi,

I no longer have the lab ready, so I can't test it quickly. But have you seen this SAP note?

2418485 - HANA indexserver crashes from Simba ODBC driver


There is a couple other notes you could check as well.
0 Kudos
Hi Steffen,

can you please share an update if you found a soltion to your problem?

Thank you very much!
steffen_mezger
Explorer
0 Kudos
Hi Thorsten, there was no way with SDA.
We get it working only with SDI

  1. Register Camel Adapter

  2. Install Java Drivers in SDI Camel for Databricks (see Datasbricks JDBC manual)

  3. Edit jdbc-spring-config.xml and remove line “<property name="defaultAutoCommit" value="false" />”

  4. Edit configfile-jdbc.properties and set delimident=BACKTICK

  5. Restart SDI, Create new Remote Datasource with this Adapter to Databricks.


Regards, Steffen
abhisheksingh2
Explorer
0 Kudos
Hi Steffen,

I've followed all the steps outlined above but getting this error:

abhisheksingh2
Explorer
0 Kudos
Followed instructions from

https://stackoverflow.com/questions/72318767/cvc-complex-type-3-2-2-attribute-setheader-is-not-allow...

that error has gone away but now getting another error:

 

[86183]{224366}[84/-1] 2023-11-06 15:27:57.654035 e FedTrace DPAdapterAccess.cpp(02413) : DPAdapterAccess::Connect: failed with error: Failed to create route route6 at: >>> Choice[[When[simple{${header.DpCommand} == 'browse' or ${header.DpCommand} == 'import'} -> [SetHeader[DpMetadataBrowser, simple{jdbc}], To[dpmetadata:dataSource]]], When[simple{${header.DpCommand} == 'query' or ${header.DpCommand} == 'update'} -> [Choice[[When[simple{${header.DpCommand} == 'query'} -> [To[jdbc:dataSource?useHeadersAsParameters=true&resetAutoCommit=false&outputType=StreamList], SetHeader[DpResultType, simple{map}]]], When[simple{${header.DpCommand} == 'update'} -> [To[jdbc:dataSource?useHeadersAsParameters=true&resetAutoCommit=true], SetHeader[DpUpdateCount, simple{${in.header.CamelJdbcUpdateCount}}]]]]], To[dpresult:dpresult]]]] Otherwise[[Stop]]] <<< in route: Route(route6)[From[direct:sda] -> [Choice[[When[simple{${hea... because of Unexpected token o at location 32
${header.DpCommand} == 'browse' or ${header.DpCommand} == 'import'

 

 
Labels in this area