Quick and effective performance analysis and visua...

boobboo · ‎01-27-2015

The idea for this blog started about a year ago at a customer site where we were having periodic slowdowns, the system was not configured in Solution Manager for performance data collection and we could only visualise 1hr of performance statistics via ST03N before we affected system performance ourselves - we had a performance problem and limited ways to access the data to troubleshoot it, this got me thinking about the most effective way to do initial performance troubleshooting.

This blog will detail a way to perform a solid initial analysis on an SAP system using simple tools and visualisations to find where your performance issue is located. As this is quick and dirty, I am going to make a couple of assumptions

1. There is no Solution Manager or equivalent receiving any performance data for the SAP instance having performance issues

2. The issues are on an ABAP stack, DB independent.

3. The collection, analysis and visualisation had to be completed within 24hrs, or it could not be considered quick.

One of the most powerful tools in a systems administrators hands on an ABAP stack is the transaction /SDF/MON

I won't describe each of the selections in detail, in general the default selections are useful for an initial data set. The main gap here is not having any visibility on the database metrics, which can pose a blind spot for any analysis. The picture below shows the output I received from a test run of the selections above.

This will provide the raw data for our analysis - although I can use data from ST03N, one of the biggest problems with that data is that there are a number of different scales for the same data type. For example time data in ST03N can be in seconds for some measures and milliseconds for others making analysis difficult.

Once the data has been collected the next phase is to analyse the data, this is easily the most time consuming part of the process - my good friend andrew.fox calls it the iceberg challenge as most of the detailed work is under the surface. Below you can see a table of the metrics output by SDF/Mon and the additional metrics to aid calculations, for example determining the number of free work processes and the amount of used memory.

Date	Time	Server Name	Act. WPs
Dia.WPs	RFC WPs	CPU Usr	CPU Sys
CPU Idle	Paging in	Paging out	Free Mem.
EM allocated	EM attached	EM global	Heap Memory
Paging Mem	Roll Mem	Dia.	Upd.
Enq.	Logins	Sessions	Pri.
Additional data	Number of WPs	Physical Mem	Free WPs

or a more readable version

After a number of attempts to find a way to represent and visualize the data in a meaningful way which would make it easy to show any issues - I found that aggregating abstracted data would be best at helping me to see specific effects.

The grouping below, shows the metrics which affect the area of study most. For example, having no free work processes, directly affects User performance, but it does not affect the Server performance. Similarly Paging affects Server performance directly, which raises CPU utilisation and indirectly affects User and Application performance.

User
Free WP	Used Memory	CPU Idle	Sessions
Server
CPU User	CPU Sys	Paging	Used Memory
Application
Free WP	Sessions	Act WP	Logins

Before grouping the metrics I had to abstract the values to give a consistent scale from good to bad. To do this I found the difference between the minimum and maximum values, then divided this into quartiles (2-4, 4-6, 6-8, 8-10). As shown below, I now had ranges to which specific and consistent values are applied.

Using the ability of Excel to filter data, shown below, I was able to add a column beside each metric and quickly mass populate it with the correct quartile value - for example, if the Used Memory on the system was 40954821 - then the value assigned to that entry is 4 as it sits in the 2nd Quartile between 28272242 and 56544484.

The table below shows how the abstraction produces the values which are now on a scale from good (2) to bad (>8), this makes graphing and visualising much easier.

The data cleansing and abstraction was by far the longest part of this piece of work, mostly because I was exploring new areas of data analysis and also because I had to be very clear about what I wanted to visualise. The visualisations were done pretty quickly, as I had consulted with some friends on the best way to visualise the data.

My initial desire was to use SAP Lumira and have an animation to show the peaks and troughs of the data in a visually stunning way - but this was not to be as Lumira does not have the ability to have Time hierarchies as detailed as seconds and the desktop version has a limitation to 10,000 data points. So in order to get the data visualised quickly, I just used Excel line graphs (shown below) - but these are not dynamic and are quite noisy.

From the graph above, you can see very easily that Application performance is generally good, the Server performance is stable but the User performance experience is worst of the three. This would then direct me to look at the components which are aggregated to provide the User totals and determine where is issue is located.

This analysis, although it has a number of steps in it, has yielded specific data which is easily understood and in a relatively short space of time. It certainly has yielded more insight to me than trying to consolidate ST03N data into a meaningful set of data.

The next stage with the analysis is to use a dynamic graphing tool like D3 Javascript library (Dynamic data documents) and a Google Spreadsheet JSON feed. This would enable me to plug any abstracted performance dataset into a Google Spreadsheet, connect the JSON feed to a D3 Javascript application and have a beautiful dynamic visualisation which I can pinch and zoom in and out of - but that is another blog for another day.

willi_eimler · ‎01-27-2015

Hi Chris,

good article and good idea! McGyver would be proude of you :wink: .

Very easy and effective way!

Best regards

Willi Eimler

boobboo · ‎01-27-2015

Thanks, it was a lot of fun to do and I learnt a lot along the way about the lack of proper analysis of server performance and load capacity within SAP environments.

It is quite scary, and probably the subject of another series of blogs

Thanks

Chris

Former Member · ‎01-27-2015

Hi Chris,

very good and helpful idea!

But I'm a little bit confused; I don't understand how you abstract data, to be able to represent them in the same graphic.

You said:

To do this I found the difference between the minimum and maximum values, then divided this into quartiles (2-4, 4-6, 6-8, 8-10).

Where do you find minimum and maximum values for every data? And minimum and maximum of what? Could you please help me to understand better??

Thank you in advance!

Regards,

Mark

boobboo · ‎01-27-2015

Mark,

One of the biggest annoyances I have had is that most performance measures are measured in different scales - so there is a built in bias and requirement for knowledgeable people to be needed to interpret the data (not saying they are not needed though). Due to the different scales, for example jumping from ms to secs when looking in ST03N at transaction performance, it is very difficult to visualise the data in an effective way.

By abstracting the data to a common scale, good to bad (2 - 10), i am able to measure each measure objectively. So in answer to your question for each measure produced by /SDF/MON, I used Excel to find the Min and Max values in the range, found the difference between them. This provided the true data range of values, which I then divided into quartiles, this gives the true quarters of the range - not from 0 but actual achieved measures captured in the system. Once I have the Quartiles, each range is provided a value from 2 (good) - 10 (bad), this abstracts the value captured to an abstracted common value which can be easily visualised.

It is not a perfect process and like all aggregations it sometimes loses the nuances and context of things, but as a guide for where to look for performance problems - I have found it to be very accurate in directing my efforts. I have used it a number of times on live performance data.

Hope this helps

Chris

Former Member · ‎01-27-2015

Hi Chris,

thank you for you answer.

From this view you can identify quickly which categories of data is more "bad" than the other but, for my understanding, you cannot say if a system in general has good or bad perfomance, because the 4 categories (from good to bad) are based on the range of the data extracted. This means that if a system has a huge performance problems, this view doesn't underline it, right? (This is not to underline what your method cannot do, but only to understand the best way to use it!)

Thank you

Regards,

Mark

boobboo · ‎01-27-2015

Mark,

Agreed if the system is struggling in a lot of areas and is constantly having poor performance then there are much better ways to attack this issue and they are pretty simple - not painless but usually pretty easy to recommend a course of action - for example constant high I/O points to the storage subsystems so there is little point at looking at the RAM of the server etc...

This example, is directed at the intermittent issue which is far more common and this will draw out those figures. As I said in my earlier comment, just because I am aggregating the data, it does not mean that there should be no understanding of it. In fact it has reaffirmed for me the idea that in order to produce truly meaningful insights into data - you should know what the data is and what it is used for. As a technical guy, I do understand the metrics and the data that is captured, so I can see when things are obviously bad in specific areas or across the board.

My advice is, collect the data, as much data as you can without affecting server performance - put it into a tool, a spreadsheet or a Database. Look at it to see if anything obvious jumps out at you, if not then apply the method discussed in the blog post.

Thanks

Chris

martin_E · ‎01-28-2015

Hi Chris,

Excellent piece of work :smile:

Mark Galanda wrote:

From this view you can identify quickly which categories of data is more "bad" than the other but, for my understanding, you cannot say if a system in general has good or bad perfomance, because the 4 categories (from good to bad) are based on the range of the data extracted.

Based on the description, I think that the situation Chris describes was one of those "everything starts going slow and we don't know why" situations, compounded by the lack of time to respond, and lack of reporting data / tools. A quick response would have been "buy more hardware", but we don't run the IT department, the business does, and we owe it to them to provide more than an educated guess. For example, I don't know what the final solution was, but it is possible (based on the findings) that the answer was just changing the WP mix. Not even requiring an outage, compared to the disruption caused by replacing / adding servers (virtual or otherwise).

In short, performance is as much a perception issue as it is one of resource management. You can have a poorly tuned SAP system that the end user is happy with (for example, excessive physical resources mask poor parameters, or even just that they don't know better). Against that, knowing when you've reached the point of diminishing returns from existing resources (i.e. the disruption to your data collection and your users is not worth the possible return from more fiddling) is part of the art of tuning.

Just as importantly, providing better information on the causes of performance issues gives the business a better perception of your abilities, including whether they will consider you "just a technician" or a trusted part of the business.

hth

boobboo · ‎01-28-2015

Martin,

Great clarification 🙂

Thanks

Chris

Quick and effective performance analysis and visualisation

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z