For so many of my customers, SAP Data Services is a relatively new tool. A typical DS project is mainly focused on ensuring the solution works for the business and is launched on time. Unfortunately, many of these projects fail to utilize some of the built-in features of Data Services to help simplify how code is managed in the solution. This is an architecture gap that adds hidden costs to owning and operating the solution.
In this article, I outline the framework for managing Data Services code that I have taught to dozens of the largest customers in the Americas. Ideally, you should implement this during the blueprint phase so that you provide your developers with the tools and processes to create better code the first time. However, you can still benefit from this framework even if you are approaching the testing phase of a Go-Live.
The elements of this framework include:
In Data Services, a "central repository" is a different type of repository used only for version control of objects. This is comparable to version control systems like CTS+, Visual SourceSafe, Apache Subversion, BusinessObjects LCM, etc. You can check in new code, check it out to work on it, check in new versions, and get copies of specific versions.
Many customers do not use central repositories. Instead, they create their code in a local repository, export the code to an ".atl" file, and import it into the test or production local repository. You can save backups of the .atl file and keep track of them in a number of ways...even Apple Time Machine and Dropbox can keep track of multiple versions of a file through time. However, this is likely not a scalable or trustworthy solution for enterprise IT.
If you want to learn how to work with a central repository, the Data Services Tutorial Chapter 12 "Multi-user Development" does a fantastic job at demonstrating all the techniques. The "Using a Central Repo" Wiki Page also captures some of the basic techniques. But neither will tell you why, or discuss how you should set up your landscape or processes.
[Note: There are two different types of central repositories: non-secure and secure. Secure central repositories allow only specific users permissions on specific objects and provide an audit trail of who changed which objects. Non-secure central repositories lack these features. Due to this gap, I never recommend the use of non-secure central repositories. In this article, whenever I refer to a central repository, I am talking about secure central repositories. Chapters 23-24 in the Data Services Designer Guide discuss these differences.]
This is how I recommend for you to configure up your secure central repositories.
Remember, central repositories are only for version control, storing your code, and helping you migrate it. You never run batch jobs or launch real-time services from a central repo -- only from a local repo.
This tiered approach plan looks like this:
The repositories themselves are just database schemas -- you can put them in any supported database. Check the Product Availability Matrix for Data Services to see which databases are supported. However, I would recommend for you to group them together within the same physical database within your specific tier. For instance:
Substitution Parameters are such a handy feature, but I seldom see them used to their full potential! If you know C++, they are similar to compiler directives. They are static values that never change during code execution (so we don't call them variables). They are called "substitution" parameters because their values get substituted into the code by the optimizer when you run the job. They can thus change the run-time behavior of your code.
Often I see many programmers use a script block at the beginning of a job to set global variable values. These global variables are then used to control the logic or mappings later in the job. However, in 90% of these cases the global variables NEVER CHANGE during runtime. So now you have several problems in your code:
Substitution parameters fix all of these global variable short-comings. They are defined for an entire repository, not per individual job. Their values are controlled at a repository level, so you don't have to include scripts to set them. They cannot change through run-time, so they don't have the risk of being modified erroneously. Lastly, they don't just have one default value -- you can set up multiple substitution parameter configurations for your repository so that you have multiple different sets of run-time values.
Here are some common uses for substitution parameters:
Substitution Parameter Configurations are helpful because they let you set multiple different sets of substitution parameters. You can use this to set up multiple configurations for:
Substitution Parameters are not objects that can be checked into a central repository, since they aren't actually code objects. As such, there is a specific way to move them between local repositories. You must export them to an ATL file and import them into another local repository. Please refer to the example below:
This is an additional step to include in your migration plans from Dev -> Test -> Production. However, it is relatively quick procedure for an administrator.
Datastore mistake 1: In many customer environments, I log into a local repository and see several datastores named similarly ("HANA_TARGET_DEV", "HANA_TARGET_QA", and "HANA_TARGET_PROD"). Or maybe I see many SAP datastores named after their SIDs ("BWD", "BWQ", "BWP). If you make this mistake, you need to go through the following unnecessary steps:
Datastore mistake 2: Since this gets to be so time-consuming, many developers realize that they can just reuse one datastore from dev to test to production. So you see a datastore named "HANA_TARGET_DEV" or "BWD" in a production local repository. In this case, the administrators just explain how they change the hostname, username, and password of the datastore when they move it to test or production. Though this sounds simple, you still run the risk that you must change more than just username/password. In the case of an SAP ECC source datastore, are the transport file paths the same between your different ECC sources?
The solution to both of these mistakes? Datastore configurations.
Datastore configurations are very powerful. They allow you to have a single datastore that can connect to multiple different sources. They work very similar to substitution parameter configurations: at run-time, the optimizer selects a single configuration, and this connection information is used for the entire execution of the job and cannot be modified. You set them up in the datastore editor...the Data Services Wiki shows a good example.
I would strongly urge you to avoid the two mistakes above by starting your project with the following principles:
For advanced users, you can even use datastore configurations to move from one database platform to another without having to re-develop all your code.
If you are using database sources or targets, these tables always have an owner name or schema name (e.g. "SCOTT"."TIGER"). In the Data Services Designer interface, these owner names exist but are not usually very obvious to the user.
This is usually a problem that manifests itself when you migrate from Dev to Test or Test to Production. Let's say you developed your dataflow and used a source table named "SQLDEVUSR1"."EMPLOYEES". The username "SQLDEVUSR1" is the table owner. You also set up a second datastore configuration for the Test environment, and the username is "SQLTESTUSR5". When you run the job and set the Test datastore to be default, the job crashes at this dataflow with a "TABLE NOT FOUND" error. Why? It connected to the database specified in the Test datastore configuration as username "SQLTESTUSR5" and tried to find a table named "SQLDEVUSR1"."EMPLOYEES". This is a design problem, not a Data Services error.
Instead, you need to tell Data Services how to interpret the name "SQLDEVUSR1" differently depending on which datastore configuration is active. There is a feature called "Aliases" in each database datastore that lets you control this!
You can create one or more aliases in each database datastore to automatically change the table owner name defined in the dataflow with the table owner name of your choice. At runtime, the optimizer does a search and replace through the code for any objects from that datastore and maps an alias named 'SQLDEVUSR1" to be "SQLTESTUSER5".
Here's another example:
This is a little-known feature, but it saves you a ton of time if you have many developers who connected to various sandbox databases when developing the code. You can simply set up multiple aliases to search for various (and possibly incorrect) owner names and map them to what their "real" owner names should be within your official Dev/Test/Production datastore configurations.
At this point, you have done the following:
Your setup might look like this:
The final step is to create system configurations. These are combinations of datastore configurations and substitution parameters that let you set up job execution profiles that can be quickly and easily set at run-time. The optimizer then chooses only that combination of configurations for the execution of the entire job. If you have never defined a system configuration in a specific local repository, you will never see it as a drop-down option when you try to run a job. However, after you configure system configurations, you will now see a convenient drop-down box that shows the names of your various system configurations:
If we use the example above with the 3 datastores with 3 different configurations and the 6 different substitution parameter configurations, you can now create system configurations as combinations of these. Here is how you might set up your system configurations:
After this, when you run or schedule a job, you would see a drop-down with your 6 different system configuration names:
To be honest, this isn't a very good example. Why would you want your production local repository to have the ability to easily execute jobs in your Dev landscape? Thus, you would probably want to set up system configurations that specifically correspond to the substitution parameter configurations and datastore configurations that you really want to use when you actually run the jobs. So in thisexample you would probably want to set up your production local repository system configurations to only include "ProdInit" and "ProdDelta" so you never make the mistake of selecting one of the Dev or Test configs.
What if you don't select a system configuration at run-time? Each datastore has a "default" datastore configuration. Likewise, there is also a default substitution parameter configuration. If no substitution parameter configuration is selected, the optimizer selects the default datastore configuration for each datastore and the default substitution parameter configuration for that repository.
Similar to substitution parameter configurations, system configurations cannot be checked into a central repository. They can be migrated in the same way you saw above with exporting substitution parameters to an ATL file. However, this is probably not necessary -- system configurations are very quick to define, and you probably only want to create them for the environment that you want to run in (e.g. "ProdInit" and "ProdDelta").
Let me know if this framework makes sense. If you see weird errors, search the KBase or file a SAP Support Message to component EIM-DS.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
10 | |
7 | |
7 | |
7 | |
7 | |
6 | |
6 | |
5 | |
5 | |
5 |