Ingesting Confluent/Kafka data into SAP Hana

werner_daehn · ‎06-11-2023

A frequently asked question is how to ingest data available in Kafka into Hana for e.g. analytics. Should it be done from Hana or from Kafka?

Hana reads from Kafka

SAP Hana has a wonderful data federation feature that can be used to make remote data appear in Hana as if it were a regular table. This feature is the technical foundation of the Data Fabric. The main advantages are

Supports all styles of integration: virtual data model, batch data integration, batch cdc integration, realtime transactional consistent integration

All is maintained in Hana via user friendly UIs

Allowed with a SAP Hana runtime license

That sounds brilliant and usually it is, but not for a streaming data provider like Kafka.

Things to consider:

Querying the data on demand for the virtual data model will obviously be as fast as the source database provides that data. Kafka is no database. Reading all available data is surprisingly difficult from a stream that constantly changes. Not to mention that Kafka contains the change data of the last week only. That is what it is meant for - to distribute change data with low latency. Querying all data on demand is the exact opposite of that.

Same applies for batch CDC, except there the queries are simpler and executed only once a day and so the additional overhead does not matter. What is a problem still is that Kafka does not contain all data of a "table".

Batch CDC is the sweet spot of Kafka. As Kafka contains the changes of the last week, reading the changes e.g. once a day into BW is a perfect use case.

For realtime data integration I am torn inside. One argument would be that Kafka is build for low latency realtime streaming. So it should be a perfect fit. But it does not support transactional consistency, neither within a table (only within one Kafka topic/partition) and certainly not across multiple tables. (Note: Kafka has a feature call transactional producers/consumers but that is for a different use case) Both of these requirements would require a functionality hindering high performance. So you would connect a high speed, messy data stream with a database perfect transactional realtime integration. That does not fit well. Obviously this argument only applies if multiple tables are ingested that have a relationship to each other. If the task is to get data from a single table in whatever order, it is less of a problem. Still, not a perfect match.

Data in Kafka can be nested and Hana cannot cope with that. While Kafka itself is agnostic, most data stream use Json or Avro (=Binary encoded Json with schema support) and this can be nested. For example it might contain sales orders with an array of the line items. How is that put into a relational database and still fast to query? Difficult, if not impossible.

Kafka writes into Hana

Kafka supports producers and consumers of streaming data and Kafka Connect is a product used to provide those for various sources out of the box. Guess what, a typical consumer of data is for relational databases. SAP provides a Hana consumer even!

Hana onPrem, SAP DataSphere, SAP Hana cloud, BW, ... all provide JDBC connectivity thus loading an existing table created via SQL or via ABAP is no problem. This greatly simplifies the process and will be of higher performance and easier to maintain.

The limitations of Kafka mentioned above apply still. For example, as Kafka itself does not provide strong transactional consistency, this method does not as well. It needs more information to achieve that, special producers and consumers.

Things to consider:

Most Kafka consumers for relational databases require flat data, including the SAP provided connector. A consumer for nested data would need to load multiple tables simultaneously and know the primary key. For example the structure sales order with an array of line items would load the sales order table and the line item table and use the order number as link between the two. Doable, in fact I have built a demonstrator for that.

Creating a table via SQL is definitely not allowed in a SAP Hana runtime license. For DataSphere it is. But tables can always be created via ABAP and then loaded via SQL. This is allowed by SAP for certain scenarios but a bit of a gray area. The only safe approaches is a Hana enterprise license or a written exemption from SAP. Hence please spend some time on that with your legal team. If the table is created via ABAP, loading will be more difficult as all columns are not-null and the data must be transformed inside Kafka Connect. Doable, but no fun.

I am curious to see what follows out of the Confluent partnership. A connector inside Hana? A Kafka Connect solution but more user friendly? For Databricks it was the first, for Kafka I hope for the latter.

pbaumann · ‎06-11-2023

Thank you wdaehn for your insights. Directly added your blog to my recent Blog about the Partnership: SAP Datasphere & Partnerships – Confluent

I'm also curious to see what happens next here...