How to add Snowflake as a source for SAP Dataspher...

Sefan_Linders · ‎06-15-2022

Introduction

At time of writing there is no dedicated connection tile for Snowflake in SAP Datasphere. An alternative that you can use today is the Generic JDBC connection, which can leverage the Snowflake JDBC driver. Since I got several questions on how to configure this and overcome certain limitations, I'd like to share my view on how to best configure this. This setup supports both federation and batch replication through remote tables. Data Flows are not natively supported, but you can use the created remote tables as a source in a Data Flow.This blog treats the connection from SAP Datasphere, but as the underlying framework for the connection is SAP Smart Data Integration, a similar configuration can be made on SAP HANA Cloud, although the user interface will be different.

Requirements

Besides the obvious need for a SAP Datasphere tenant and a Snowflake account, you will need the following:

Installed SAP Data Provisioning Agent (version 2.6.1.0 or later). The installation is not handled in this blog;
Admin rights on Snowflake to create a user and assign privileges.

How to set up

Prepare the CamelJDBC adapter and XML files

Most of the steps are described or linked from in these SAP Help pages, but in summary the following steps need to be followed:

Upload the Snowflake JDBC driver to folder <DPAgent_root>/camel/lib;

After uploading the driver, restart the DP Agent for the driver to load;

Check if the CamelJDBC config part in <DPAgent_root>/camel/adapters.xml is uncommented (details here);

Make sure CAP_LIMIT is part of the capabilities in file <DPAgent_root>/camel/adapters.xml, otherwise TOP or LIMIT operations are not pushed down;

Make sure the Generic JDBC adapter is activated in SAP Datasphere;

If you made any changes in the config files or upgraded the agent, make sure to refresh the capabilities in SAP Datasphere.
Make sure you have the right BACKTICK setting. In <DPAgent_Install_DIR>/camel, check the file configfile-jdbc.properties. It should say delimident=true, not delimident=BACKTICK. If other Camel clients require this, they should reside in a separate DP agent installation.

Edit file <DPAgent_root>/camel/sample-jdbc-dialect.xml and change the data type mapping for source type "NUMBER". Make sure the file has the following two entries for that source type. This is necessary because in Snowflake, any integer-like value (e.g. tinyint or bigint) is defined as NUMBER (38,0) in the source dictionary. To overcome issues with values not fitting in the default mapped HANA INTEGER data type, the mapping should be changed to BIGINT.

<Mapping srcType="NUMBER" length="" precision="" scale="[1,38]" hanaType="DECIMAL" />
<Mapping srcType="NUMBER" length="" precision="" scale="[-38,0]" hanaType="BIGINT" />

Create a user in Snowflake

Create a dedicated user in Snowflake that you use to logon with from SAP Datasphere. With the following SQL statements I created a user on Snowflake, but there is obviously a bit of freedom here on how you want to do things. Please do note that:

This user requires operate/usage privileges on a warehouse or you will run into errors;

I have chosen not to assign a default database and a default warehouse to the user, and instead I provide those parameters in the connection string in the SAP Datasphere connection. But you have to assign it either to the user directly, or add it in the connection string, or you run into errors.

I recommend to assign a maximum of one database to a user. Assigning more databases can become confusing, as SAP Datasphere will list all schemas of all databases in one go, without differentiating between databases, and only the default database can actually be queried. If you have more than one database to connect to, create one connection with a dedicated user for each database.

create role dwc_role;
grant operate on warehouse sample_wh_xs to role dwc_role;
grant usage on warehouse sample_wh_xs to role dwc_role;
create user dwc_generic password='<PASSWORD>' must_change_password = false;
grant role dwc_role to user dwc_generic;
alter user dwc_generic set default_role = dwc_role;

Create the connection in SAP Datasphere

You need to create a DWC connection for each Snowflake database you want to connect. The reason is that one Snowflake connection can give you access to multiple databases. However, SAP Datasphere does not use the database metadata. This can result in multiple issues:

Seeing multiple schemas with the same name, not knowing to which database they belong;

Not being able to fetch data from any other database other than the default database that has been set for the user or configured in the connection string.

In SAP Datasphere, choose to setup a new connection and choose connection tile "Generic JDBC". The configuration should look similar as in below screenshot.

Figure 1: Configuration in Generic JDBC connection tile in SAP Datasphere

JDBC driver class: net.snowflake.client.jdbc.SnowflakeDriverJDBC connection string: jdbc❄️//.snowflakecomputing.com?db=SNOWFLAKE_SAMPLE_DATA&warehouse=sample_wh_xsAs you can see, I have set a database and warehouse which will be used as default in the JDBC session. Alternatively, you can leave this empty and assign a default user and warehouse when creating the user in Snowflake.After completing this connection wizard, the setup is completed and you should be able to start using your Snowflake as a source for modeling.

Troubleshooting

Below, the most common issues with their solution are listed. Please note that when you run into something unexpected in SAP Datasphere regarding this connection type, and the error thrown in the user interface is not clear or not present, please check the Data Provisioning Agent log files. This can be done either from SAP Datasphere directly, or by checking the framework.trc log files on the Data Provisioning Agent instance in the <DPAgent_root>/log directory.

Data preview fetches all data from source

Upon data preview, it can happen that all data is being fetched from the source. You would notice this either because the data preview time is very high, or you have checked the SQL statement in the remote query monitor, or in the Snowflake historic statements where you do not see a TOP statement applied. In that case:

Check if your Data Provisioning Agent is up to date. Make sure to run the latest DPA. A feature to push down TOP and LIMIT operations was added in version 2.6.1.0;

(After upgrading) make sure to check the capabilities as explained in paragraph "Prepare the CamelJDBC adapter and XML files", rebooted the Agent, and refreshed the agent capabilities in SAP Datasphere;

You might also have to re-save your connection in SAP Datasphere, by just opening the connection settings, re-entering your credentials and saving the connection. Sometimes this is needed additionally to refresh the adapter capabilities for this remote source.

DWC does not list databases, and shows all schemas of all databases unorganized

Below a screenshot of seeing duplicate schema names in SAP Datasphere.

Figure 2: Snowflake schema representation in SAP Datasphere source hierarchy

The reason for the duplicate schemas showing up, is that these schemas are present in multiple Snowflake databases. However, the database metadata is not used to present the source system hierarchy and therefore the schemas look like duplicates.To partially overcome the representation issue, the Snowflake database user should be restricted in the number of databases authorised for. This also means that if you want to access multiple Snowflake databases, you should create separate connections and use separate Snowflake users for each connection.However, there are always the two databases DEMO_DB and UTIL_DB assigned to any user, which both have a schema INFORMATION_SCHEMA and PUBLIC. Therefore, these will always show up as duplicates.

DWC data preview error: “Schema X does not exist or not authorized”

When trying to access a schema other than the user default schema, the following error is listed in the agent logs after trying to fetch data, even though the remote table can be created. When doing a data preview, the error occurs.2022-05-19 12:36:44,419 [ERROR] [1ffdd62b-2fac-4883-9841-54281957b4cc52674] DefaultErrorHandler | CamelLogger.log - Failed delivery for (MessageId: ID-ip-172-31-0-19-1652960718543-0-3 on ExchangeId: ID-ip-x-x-0-3). Exhausted after delivery attempt: 1 caught: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error:
Schema 'SAMPLE_DB.TPCH_SF1' does not exist or not authorized.The reason for this is that SAP Datasphere is not aware of, or ignores, the database parameter. To overcome this issue, create an individual connection for each Snowflake database that you want to access, with a parameter to set the default database. This can be achieved using a database connection parameter in the connection settings, as you can see in the paragraph Create the connection in SAP Datasphere.

Data preview returns 0 records, or throws an error on data preview “an error occurred while loading metadata”

It can happen (like it happened to me) that initially data fetching and data preview is working, and then at some point it just runs into an error. When this happened to me, the error in the agent logs was as follows:2022-05-23 08:22:08,550 [ERROR] [15ad44f9-9977-4e42-b317-b045f7fa4cc474829] DefaultErrorHandler | CamelLogger.log [] - Failed delivery for (MessageId: EDAED96EEC4A7FE-0000000000000002 on ExchangeId: EDAED96EEC4A7FE-0000000000000002). Exhausted after delivery attempt: 1 caught: net.snowflake.client.jdbc.SnowflakeSQLException: No active warehouse selected in the current session. Select an active warehouse with the 'use warehouse' command.It turns out that I had not set an active warehouse for the user, a setting that might have changed after I stopped working actively in the Snowflake admin tool and a new session was initiated which did not have a default warehouse assigned.The solution is to assign a default warehouse to the user. This can be done either by assigning a default warehouse to the user using Snowflake user management, or by defining the default warehouse in the connection parameters, as you can see in the example in the connection configuration paragraph. When you make changes like these, you might have to disconnect the session from the Snowflake admin tool to force a new session with the new settings.

Data preview error: NumberFormatException

In most cases, this is because you are loading values larger than Integer into a HANA Integer field. Check paragraph Prepare the CamelJDBC adapter and XML files on how to change the mapping for source data type NUMBER(38,0).You might also have another data type mapping issue and have to adjust another source to target data type mapping. Check the agents framework.trc logs for more clues. Usually the erroneous value is listed, from which you can deduct which source or target data type is the culprit.

Cannot load JDBC driver class

The following error might appear in the DP Agent framework.trc file:

Failed to query schemas. org.apache.commons.dbcp.SQLNestedException: Cannot load JDBC driver class 'net.snowflake.client.jdbc.SnowflakeDriver'

In that case:

Check if the driver is really the right Snowflake JDBC driver (in my case the file name has the pattern "snowflake-jdbc-<version>.jar") and if permissions allow the dp agent user to read it;

Check if you uploaded into the right folder (<DPAgent_root>/camel/lib);

Make sure you restarted the DP Agent after you uploaded the driver;

No specific error, but the connection doesn't work

If the connection validation throws a non-specific error such as "internal error: Cannot get remote source objects [...]" and the DP Agent logs do not show any more specific errors, then double check the hostname of your Snowflake instance. If there are errors in the hostname, this might not be be found in the DP Agent log files.

Conclusion

Hopefully this blog helped you setting up or troubleshooting your connection from SAP Datasphere to Snowflake, using the Generic JDBC connection tile. If you have any experiences to add, just leave them in the comments.

marke_ · ‎09-09-2022

You didnt mention the capabilities.xml file in “Prepare the CamelJDBC adapter and XML files” (or at least I cant find it, how to set Cap_*?

Sefan_Linders · ‎12-07-2022

Capabilities are defined in <DPAgent_root>/camel/adapters.xml. I see I made an error in the troubleshooting section where I pointed to the wrong xml file. That's corrected now.

former_member446513 · ‎04-18-2023

Can we add CAP_UNION as well : I have added it to adapter capabilities and reflected in HANA level as well but the query behavior is not changed, union is not getting push down to snowflake in this case.

Any leads?

Sefan_Linders · ‎05-08-2023

If you run below query, you see that in the default adapter capabilities for CamelJDBC, UNION is not supported. Changing this in the adapters file usually does not change this and can even lead to errors.

select * from adapter_capabilities

where adapter_name = 'CamelJdbcAdapter' and UPPER(description) like '%UNION%'

marcoroth · ‎06-26-2023

Hi Stefan,

In the Part "How to setup" you wrote, the Capability CAP_LIMIT should be set in the adapters.xml file. But actually in the "Tag Area" CamelJdbcAdapter this Capability is not listed.

What's your recommendation. Do I've to add this capability manually or is this not an option, because the Capability isn't part within the current release (we use 2.6.3.4)

Many thanks for your feedback

Best regards,

Marco

Sefan_Linders · ‎06-26-2023

Not sure what you mean by "Tag Area", but I assume you are looking at the capabilities tag. You are on the right release, so you can just open the adapter.xml file, and add a line with the CAP_LIMIT capability. This would look like follows, and if I look at version 2.6.4.3 on an unedited file that capability would list at line 311 although it doesn't matter where in that list you add it:

[...]

<Capabilities>

CAP_LIMIT,

CAP_AND_DIFFERENT_COLUMNS,

CAP_TRUNCATE_TABLE,

[...]

After saving the file, refresh the capabilities in SAP Datasphere. You can check in SAP Datasphere if the capability has successfully been updated in the database by running the following query using a DBA user:

select * from adapter_capabilities

where adapter_name = 'CamelJdbcAdapter' and CAPABILITY_NAME = 'CAP_LIMIT'

Joel_B · ‎07-21-2023

Hi Stefan,

Thanks for the blog post. I was able to follow along and use DP Agent to deploy the Snowflake connections. The problem I am now facing is that the connector:

Data flows are not supported

Replication flows are not supported

Remote Tables are enabled (making this the only option I see available)

The result of the above is that the only method to utilize the tables I can see is to replicate table snapshots. Is there something I might be missing? Is there any other option other than replication configurable?

Thank you!

Sefan_Linders · ‎07-24-2023

Data flows and replication flows currently indeed don’t support generic JDBC. But besides table snapshots and federation, you can also leverage data flows by reading from a virtual table that you create before creating the data flow. The virtual table is then just a source table for the data flow.

Joel_B · ‎07-24-2023

Hi Sefan,

Thank you so much for the feedback and suggestions! I am very new to DS and working on our first POC, meaning I am still trying to piece together a proper architecture design. I was hoping that SDI would bring us functionality for Real-Time Replication or replication via delta. Instead, Snowflake running over Generic JDBC does not provide the capabilities available to other Connection types.

If I understand correctly, the only options for Snowflake and JDBC-based drivers are:

Snapshot replication of the whole Remote Table

Use Data flows to import the Remote table data to a DS Table
- Use Batches in a data flow to limit memory usage and transfer sizes
- Then, use views to logically partition the data such that dataflows can replicate just the changed data. Though, this design is completely manual.

In my case, the Snowflake data is massive and cannot be replicated as whole table snapshots. Do you have any suggestions for delta replication other than to use a Data Flow and a view for manual partitioning?

Thank you!

Joel

Sefan_Linders · ‎07-25-2023

You could also look into the view partitioning which allows you to lock partitions, see this help page. The locked partitions won't be re-loaded upon snapshotting the view. You would then obviously need a partition strategy where you know that certain partitions remain unchanged. Do you have one or more columns in the source data that tell you if data has changed, such as a creation or changed date?

sruggiero · ‎08-03-2023

I am getting something similar. Sefan, do you have any insight to this?

Also, my goal is to use Datasphere to "Virtualize" Snowflake data. I assume I would just use a SQL View?

Error

former_member446513 · ‎08-24-2023

Hello,

I'm getting below error in hana when we try to read data from snowflake:

I have added the solution already but still failed:

Data in the source is like :

Source data type is number (21,7) and target datatype is decimal (21,7).

any idea how to fi this?

Thanks for the support in adv.

Br//Rajesh G

Sefan_Linders · ‎08-24-2023

Can you try to replace the entries provided in the blog with the following?

<Mapping srcType="NUMBER" length="" precision="[0,38]" scale="[1,38]" hanaType="DECIMAL" /> 		

<Mapping srcType="NUMBER" length="" precision="[0,38]" scale="[0,0]" hanaType="BIGINT" /> 	

<Mapping srcType="DECIMAL" length="" precision="[0,38]" scale="[1,38]" hanaType="DECIMAL" /> 	

<Mapping srcType="DECIMAL" length="" precision="[0,38]" scale="[0,0]" hanaType="BIGINT" /> 			   

<Mapping srcType="INT" length="" precision="" scale="" hanaType="INTEGER" /> 		

<Mapping srcType="INTEGER" length="" precision="" scale="" hanaType="INTEGER" /> 		

<Mapping srcType="BIGINT" length="" precision="" scale="" hanaType="BIGINT" />

Joel_B · ‎10-23-2023

Any thoughts on how to handle Snowflake UUID types? Thanks to a large degree to your blog post, we are now implementing a Snowflake to DS system. Thank you for helping us to get here! The one issue we are are really struggling with is that the Camel driver converts Snowflake native type : UUID to VARCHAR(16777216). As it happens, our snowflake data source uses UUID a great deal. The result of course is that REMOTE Tables are now massively inefficient and too large. Also, Datasphere treats all these fields are LargeString, also unacceptable. The best work-around we have come up with is to change jdbc-dialict.xml such that:

<Mapping srcType="VARCHAR" length="" precision="" scale="" hanaType="VARCHAR" />

The above truncates the VARCHAR down to 5000 and then maps in DS To String(5000). Do you happen to have any other suggestions as to how to handle this issue? Thank you!

Sefan_Linders · ‎10-24-2023

If the connection validation throws a non-specific error such as “internal error: Cannot get remote source objects […]” and the DP Agent logs do not show any more specific errors, then double check the hostname of your Snowflake instance. If there are errors in the hostname, this might not be be found in the DP Agent log files.

Sefan_Linders · ‎10-24-2023

With what data type have you defined your UUID column in Snowflake? I cannot find a "UUID" data type in Snowflake. In fact, the Snowflake documentation describes that their UUID generator function produces a string-based UUID with a length of 36, so I would expect a proper UUID column to be defined as a text-based column with exactly that length.

In your case, it might be that the UUID column is not defined with a specific length. Then, Snowflake defines it with max length 16,777,216, which is correctly read by the DP Agent.

So, run a check on your column definition in Snowflake:

select TABLE_NAME, COLUMN_NAME, DATA_TYPE, CHARACTER_MAXIMUM_LENGTH from <DATABASE>.INFORMATION_SCHEMA.COLUMNS where table_name = '<TABLE_NAME>'

srinivasareddym · ‎11-14-2023

Hello Sefan,





Thank you for a nice post. It helps.



I would like to have clarification on CAP_LIMIT. I added the capability at adapters.xml and can see the same at table level also.

However when I did SELECT TOP/LIMIT, filter push down is not happening. Query is bringing all the records available in source

and finally displaying TOP * records. I am using DPAgent of version 2.5.5.2 with HANA 2.0 SPS04.



Please let me know if I am missing anything here.



Thank you

Sefan_Linders · ‎11-14-2023

As stated in the blog, you need minimum version 2.6.1.0. Any version older than that does not push down TOP/LIMIT.

mcalucin · ‎03-11-2024

Hi Stefan,

Thanks for the very informative blog. Using this blog, I manage to get my BW/4HANA system working with Snowflake via SDI - DP Agent. It's been working ever since I read this blog almost two years ago.

However, something new has come up.

When I try to create a virtual table from a SNOWFLAKE view, I get the following error:
SAP DBTech JDBC: [403]: internal error: Failed to create virtual table: duplicate column name: BASE_UOM: line 1 col 1038

I've upgraded to DP Agent 2.7.3.2 -- windows version with snowflake-jdbc-3.13.26.jar, and I'm still having the same problem.

No errors are showing in the framework.trc.

What additional this should I look for to troubleshoot this issue.

Regards,

Mel Calucin