K-Means Outlier Detection Algorithm ABAP Implement...

former_member260034 · ‎02-03-2022

Introduction:

Machine learning algorithms such as linear regression, logistic regression, decision tree are very popular topic in today’s market . This article focused on anomaly detection with k-means algorithm by using outliers.

K-means algorithm may be used on diverse of use case scenarios from image compression to system monitoring applications. Identifying anomaly in daily CPU resource utilization trends can be given as a sample use-case. Traditional system monitoring applications use fixed threshold values to identify a bottleneck over the system. But regarding today's requirements, it is not enough to monitor utilizations with fixed thresholds values. Monitoring system should to learn system resource utilization trends by using ML algorithms and create an alert in anomaly situation.

K-Means is a suitable solution for detecting anomalies in large data sets. Similarity analysis can be given as an example to detect anomalies in a dataset. In the example, two group of cars formed as two different clusters. Those clusters will have its own members in its immediate vicinity that is close similarity groups. If the element is close to the center of the cluster it becomes a group member. Elements far from the center, it identifies an anomaly situation for that cluster. If there is low similarity than those clusters, elements outside the clusters evaluates as an anomaly form.

K-means algorithm is basically based on Euclid calculation to measure distance of a tuple to centroid. This calculation repeating until distance values are being stable for each tuple. Please find k-Means visual sample on figure 1;

Figure 1

k-Means algorithm steps performed as below;

1- Assign each tuple to a randomly selected set

2- Calculate the centroid's distance from each cluster

3- Loop until there is no improvement or you reach the maxcount variable

4-    Assign each tuple to the most appropriate cluster

5-    Update cluster centroid

6- End loop

7- Return cluster information

HANA has its own k-means algorithm implementation in PAL library. This algorithm can be used easily by any kind of application that running a HANA system.

This article focused on k-Means outliers ABAP implementation without using HANA PAL library. Class has two main public functions. First one is “cluster” method. Second one is “outlier” method. In the cluster method, k-Means attribute values can be provided and assigned dynamically. It can have as many attributes as problem needs.

Dictionary objects:

Before coding, data types need to be defined at ABAP dictionary level.

Table types:

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Structures:

Figure 9

Figure 10

Figure 11

Figure 12

Figure 13

Test Program:

Dictionary objects figure 2, 3, 4, 5, 6, 7,8, 9, 10, 11, 12 ,13 should be created manually. Please find test program, below;

REPORT zkmeans_test.



TYPES: tt_atts    TYPE STANDARD TABLE OF string WITH EMPTY KEY,

       tt_rawdata TYPE TABLE OF zks_col_kmeans_rawdata WITH EMPTY KEY.



DATA: clustering    TYPE REF TO zks_col_kmeans_clustering_tt,

      outlier       TYPE REF TO zks_col_kmeans_outlier_tt,

      numattributes TYPE i,

      numclusters   TYPE i,

      maxcount      TYPE i,

      lr_kmeans     TYPE REF TO zcl_kmeans,

      dataset       TYPE tt_rawdata.



DATA(attributes) = VALUE tt_atts(

                    ( `SampleAttribute1` )

                   ).



dataset = VALUE #( ( table = VALUE #( ( value = 100 ) ) )

                   ( table = VALUE #( ( value = 99 ) ) )

                   ( table = VALUE #( ( value = 98 ) ) )

                   ( table = VALUE #( ( value = 70 ) ) )

                   ( table = VALUE #( ( value = 95 ) ) )

                   ( table = VALUE #( ( value = 77 ) ) )

                   ( table = VALUE #( ( value = 100 ) ) )

                   ( table = VALUE #( ( value = 97 ) ) )

                   ( table = VALUE #( ( value = 99 ) ) )

                  ).



numattributes = lines( attributes ).

numclusters = 2.

maxcount = 30.



CREATE OBJECT lr_kmeans.



clustering = lr_kmeans->cluster( 

                      rawdata = dataset

                      numclusters = numclusters

                      numattributes = numattributes

                      maxcount = maxcount ).



outlier = lr_kmeans->outlier( 

                      rawdata = dataset 

                      clustering = clustering 

                      numclusters = numclusters 

                      cluster = 0 ).

In example above, only one attribute with a few data provided into the cluster method. Sample test program source code has only one centroid to make it easy to understand. Of course this value can be increased regarding the requirements. Max count value has been set to 30, in test program. But the higher this value, the better results it produces. In the example, in order to identifiy outlier values on cluster, outlier method has been called after cluster method. As soon as outlier values has been identified by the outlier method, those values can be dropped out from provided dataset to filter noisy values.

Class Implementation:

There are two public methods on ABAP ZCL_KMEANS class; cluster and outlier. Other methods are utility functions that help to execute k-Means algorithm. Please find method signatures and implementations by description;

Name	CLUSTER
Description	Assign cluster node id to items in dataset
Level	Instance Method
Visibility	Public

Parameter	Type	Pass By Value	Typing Method	Associated Type
RAWDATA	Importing		Type	ZKS_COL_KMEANS_RAWDATA_TT
NUMCLUSTERS	Importing		Type	INT4
NUMATTRIBUTES	Importing		Type	INT4
MAXCOUNT	Importing		Type	INT4
CLUSTERING	Returning	X	Type Ref To	ZKS_COL_KMEANS_CLUSTERING_TT

  METHOD cluster.



    DATA: changed   TYPE boolean VALUE abap_true,

          ct        TYPE i VALUE 0,

          numtuples TYPE i VALUE 0,

          means     TYPE REF TO zks_col_kmeans_means_tt,

          centroids TYPE REF TO zks_col_kmeans_centroids_tt.



    FIELD-SYMBOLS: <lr_rawdata> TYPE zks_col_kmeans_rawdata_tt.



    numtuples = lines( rawdata ).



    clustering = initclustering( numtuples = numtuples

                              numclusters = numclusters

                              randomseed = 0 ).



    means = allocatemeans( numclusters = numclusters

                        numattributes = numattributes ).



    centroids = allocatecentroids( numclusters = numclusters

                               numattributes = numattributes ).



    ASSIGN rawdata TO <lr_rawdata>.



    updatemeans( rawdata = <lr_rawdata>

                clustering = clustering

                means = means ).



    updatecentroids( rawdata = rawdata

                      clustering = clustering

                      means = means 

                      centroids = centroids ).



    WHILE changed = abap_true AND ct < maxcount.

      ct = ct + 1.

      changed = Assign( rawData = rawData

                      clustering = clustering

                      centroids = centroids ).



      UpdateMeans( rawData = rawData 

                      clustering = clustering means = means ).



      UpdateCentroids( rawData = rawData 

                      clustering = clustering 

                      means = means 

                      centroids = centroids ). 

    ENDWHILE.



  ENDMETHOD.

Name	OUTLIER
Description	Identify and return anomaly data in provided dataset
Level	Instance Method
Visibility	Public

Parameter	Type	Pass By Value	Typing Method	Associated Type
RAWDATA	Importing		Type	ZKS_COL_KMEANS_RAWDATA_TT
CLUSTERING	Importing		Type Ref To	ZKS_COL_KMEANS_CLUSTERING_TT
NUMCLUSTERS	Importing		Type	INT4
CLUSTER	Importing		Type	INT4
OUTLIER	Returning	X	Type Ref To	ZKS_COL_KMEANS_OUTLIER_TT

  METHOD outlier.



    DATA: numattributes   TYPE i VALUE 0,

          means           TYPE REF TO zks_col_kmeans_means_tt,

          centroids       TYPE REF TO zks_col_kmeans_centroids_tt,

          total_dist      TYPE i,

          rawdatalength   TYPE i,

          i               TYPE i VALUE 1,

          c               TYPE i,

          dist            TYPE i,

          wa_outlier like LINE OF outlier->*.



    numattributes = lines( rawdata[ 1 ]-table ).



    FIELD-SYMBOLS: <lr_rawdata> TYPE zks_col_kmeans_rawdata_tt.



    CREATE DATA outlier.



    means = allocatemeans( numclusters = numclusters

                        numattributes = numattributes ).



    centroids = allocatecentroids( numclusters = numclusters

                               numattributes = numattributes ).



    ASSIGN rawdata TO <lr_rawdata>.



    updatemeans( rawdata = <lr_rawdata>

                clustering = clustering

                means = means ).



    updatecentroids( rawdata = rawdata

                      clustering = clustering

                      means = means

                      centroids = centroids ).



    rawdatalength = lines( rawdata ) + 1.



    WHILE i NE rawdatalength.

      c = clustering->*[ i ]-id.

      IF c EQ cluster.

        dist = distance( tuple = rawdata[ i ]-table vector =

                       centroids->*[ cluster + 1 ]-table ).

        total_dist = total_dist + dist.

      ENDIF.

      i = i + 1.

    ENDWHILE.



     total_dist = total_dist / rawDataLength.



    i = 1.

    WHILE i NE rawdatalength.

      c = clustering->*[ i ]-id.

      IF c EQ cluster.

        dist = distance( tuple = rawdata[ i ]-table vector =

                       centroids->*[ cluster + 1 ]-table ).

        If dist > total_dist.

          wa_outlier-table = rawdata[ i ]-table.

          APPEND wa_outlier to outlier->*.

        EndIf.

      ENDIF.

      i = i + 1.

    ENDWHILE.



  ENDMETHOD.

Name	INITCLUSTERING
Description	Create and initialize clustering internal table with randomized values for each centroid id and assign "0" value to its data field. Each row has been assigned to a cluster id in dataset in this array
Level	Instance Method
Visibility	Private

Parameter	Type	Pass By Value	Typing Method	Associated Type
NUMTUPLES	Importing		Type	INT4
NUMCLUSTERS	Importing		Type Ref To	INT4
RANDOMSEED	Importing		Type	INT4
CLUSTERING	Returning	X	Type	ZKS_COL_KMEANS_CLUSTERING_TT

  METHOD initclustering.

    FIELD-SYMBOLS: <clustering> TYPE zks_col_kmeans_clustering_tt.



    DATA: lv_random        TYPE REF TO cl_abap_random_int,

          lv_wa_clustering LIKE LINE OF <clustering>,

          lv_counter       TYPE i.



    lv_random = cl_abap_random_int=>create( seed = CONV i( sy-uzeit )

                                      min  = 0

                                      max = numclusters - 1 ).



    CREATE DATA clustering.

    ASSIGN clustering->* TO <clustering>.



    lv_counter = 0.

    WHILE lv_counter NE numclusters.

      lv_wa_clustering-id = lv_counter.

      APPEND lv_wa_clustering TO <clustering>.

      lv_counter = lv_counter + 1.

    ENDWHILE.



    WHILE lv_counter LE numtuples - 1.

      lv_wa_clustering-id = lv_random->get_next( ).

      APPEND lv_wa_clustering TO <clustering>.

      lv_counter = lv_counter + 1.

    ENDWHILE.

  ENDMETHOD.

Name	ALLOCATECENTROIDS
Description	Create and initialize centroid value internal table
Level	Instance Method
Visibility	Private

Parameter	Type	Pass By Value	Typing Method	Associated Type
NUMTUPLES	Importing		Type	INT4
NUMCLUSTERS	Importing		Type	INT4
CENTROIDS	Returning	X	Type Ref To	ZKS_COL_KMEANS_CENTROIDS_TT

  METHOD allocatecentroids.



    DATA: lv_wa_centroids      LIKE LINE OF centroids->*,

          lr_attributes        TYPE REF TO zks_col_kmeans_attr_table_tt,

          wa_attributes        LIKE LINE OF lr_attributes->*,

          lv_cluster_counter   TYPE i VALUE 0,

          lv_attribute_counter TYPE i VALUE 0.



    CREATE DATA centroids.



    WHILE lv_cluster_counter NE numclusters.



      CREATE DATA lr_attributes.

      WHILE lv_attribute_counter NE numattributes.

        APPEND wa_attributes TO lr_attributes->*.

        lv_attribute_counter = lv_attribute_counter + 1.

      ENDWHILE.

      MOVE lr_attributes->* TO lv_wa_centroids-table.



      APPEND lv_wa_centroids TO centroids->*.

      lv_cluster_counter = lv_cluster_counter + 1.

      lv_attribute_counter = 0.

    ENDWHILE.



  ENDMETHOD.

Name	ALLOCATEMEANS
Description	Create and initialize mean value internal table
Level	Instance Method
Visibility	Private

Parameter	Type	Pass By Value	Typing Method	Associated Type
NUMTUPLES	Importing		Type	INT4
NUMCLUSTERS	Importing		Type	INT4
MEANS	Returning	X	Type Ref To	ZKS_COL_KMEANS_MEANS_TT

  METHOD allocatemeans.

    DATA: lv_wa_means          LIKE LINE OF means->*,

          lr_attributes        TYPE REF TO zks_col_kmeans_attr_table_tt,

          wa_attributes        LIKE LINE OF lr_attributes->*,

          lv_cluster_counter   TYPE i VALUE 0,

          lv_attribute_counter TYPE i VALUE 0.



    CREATE DATA means.



    WHILE lv_cluster_counter NE numclusters.



      CREATE DATA lr_attributes.

      WHILE lv_attribute_counter NE numattributes.

        APPEND wa_attributes TO lr_attributes->*.

        lv_attribute_counter = lv_attribute_counter + 1.

      ENDWHILE.

      MOVE lr_attributes->* TO lv_wa_means-table.



      APPEND lv_wa_means TO means->*.

      lv_cluster_counter = lv_cluster_counter + 1.

      lv_attribute_counter = 0.

    ENDWHILE.



  ENDMETHOD.

Name	UPDATEMEANS
Description	Update mean values
Level	Instance Method
Visibility	Private

Parameter	Type	Pass By Value	Typing Method	Associated Type
RAWDATA	Importing		Type	ZKS_COL_KMEANS_RAWDATA_TT
CLUSTERING	Importing		Type Ref To	ZKS_COL_KMEANS_CLUSTERING_TT
MEANS	Importing	X	Type Ref To	ZKS_COL_KMEANS_MEANS_TT

  METHOD updatemeans.



    DATA: numclusters   TYPE i,

          wa_means      LIKE LINE OF means->*,

          lr_attributes TYPE REF TO zks_col_kmeans_attr_table_tt,

          wa_attributes LIKE LINE OF lr_attributes->*.



    FIELD-SYMBOLS: <fs> TYPE zks_col_kmeans_attr_table_tt.



    numclusters = lines( means->* ).



    LOOP AT means->* INTO wa_means.

      ASSIGN wa_means-table TO <fs>.



      LOOP AT <fs> INTO wa_attributes.

        wa_attributes-value = 0.

        MODIFY <fs> FROM wa_attributes.

      ENDLOOP.



      MODIFY means->* FROM wa_means.

    ENDLOOP.



    DATA: clustercounts    TYPE STANDARD TABLE OF i WITH EMPTY KEY,

          countnumclusters TYPE i,

          rawdatalength    TYPE i VALUE 0,

          i                TYPE i VALUE 1, " cluster array index start with 1

          j                TYPE i VALUE 1, " means array index start with 1

          k                TYPE i VALUE 1, " means array index start with 1

          l                TYPE i VALUE 1, " means array index start with 1

          cluster          TYPE i,

          meanslength      TYPE i VALUE 0,

          meansattributeslength type i value 0,

          rawdataattributeslength type i value 0.



    WHILE numclusters NE countnumclusters.

      APPEND 0 TO clustercounts.

      countnumclusters = countnumclusters + 1.

    ENDWHILE.



    rawdatalength = lines( rawdata ) + 1.



    WHILE i NE rawdatalength.

      cluster = clustering->*[ i ]-id + 1.

      clustercounts[ cluster ] = clustercounts[ cluster ] + 1.



      j = 1.

      rawdataattributeslength = lines( rawdata[ i ]-table ) + 1.

      WHILE j NE rawdataattributeslength.

        means->*[ cluster ]-table[ j ]-value =

           means->*[ cluster ]-table[ j ]-value +

           rawdata[ i ]-table[ j ]-value.

        j = j + 1.

      ENDWHILE.



      i = i + 1.

    ENDWHILE.



    meanslength = lines( means->* ) + 1.



    WHILE k NE meanslength.

      meansattributeslength = lines( means->*[ k ]-table ) + 1.



      l = 1.

      WHILE l NE meansattributeslength.

        means->*[ k ]-table[ l ]-value =

          means->*[ k ]-table[ l ]-value / clustercounts[ k ].

        l = l + 1.

      ENDWHILE.



      k = k + 1.

    ENDWHILE.



  ENDMETHOD.

Name	UPDATECENTROIDS
Description	Update centroid values
Level	Instance Method
Visibility	Private

Parameter	Type	Pass By Value	Typing Method	Associated Type
RAWDATA	Importing		Type	ZKS_COL_KMEANS_RAWDATA_TT
CLUSTERING	Importing		Type Ref To	ZKS_COL_KMEANS_CLUSTERING_TT
MEANS	Importing		Type Ref To	ZKS_COL_KMEANS_MEANS_TT
CENTROIDS	Importing		Type Ref To	ZKS_COL_KMEANS_CENTROIDS_TT

  METHOD updatecentroids.



    DATA: centroidslenght TYPE i VALUE 0,

          k               TYPE i VALUE 1, " means array index start with 1

          centroid        TYPE REF TO zks_col_kmeans_centroids_tt.



    centroidslenght = lines( centroids->* ) + 1.



    WHILE k NE centroidslenght.

      centroid = computecentroid( rawdata = rawdata

                                  clustering = clustering

                                  cluster = k - 1

                                  means = means ).

      centroids->*[ k ]-table = centroid->*[ 1 ]-table.

      k = k + 1.

    ENDWHILE.



  ENDMETHOD.

Name	ASSIGN
Description	Update centroid values
Level	Instance Method
Visibility	Private

Parameter	Type	Pass By Value	Typing Method	Associated Type
RAWDATA	Importing		Type	ZKS_COL_KMEANS_RAWDATA_TT
CLUSTERING	Importing		Type Ref To	ZKS_COL_KMEANS_CLUSTERING_TT
CENTROIDS	Importing		Type Ref To	ZKS_COL_KMEANS_CENTROIDS_TT
CHANGED	Returning	X	Type	ABAP_BOOL

  METHOD assign.



    DATA: numclusters      TYPE i VALUE 0,

          distances        TYPE zks_col_kmeans_distances_tt,

          wa_distances     LIKE LINE OF distances,

          distancescounter TYPE i VALUE 0,

          rawdatalength    TYPE i VALUE 0,

          i                TYPE i VALUE 1,

          k                TYPE i VALUE 1,

          newcluster       TYPE i VALUE 0.



    numclusters = lines( centroids->* ).



    WHILE distancescounter NE numclusters.

      APPEND wa_distances TO distances.

      distancescounter = distancescounter + 1.

    ENDWHILE.



    rawdatalength = lines( rawdata ) + 1.

    WHILE i NE rawdatalength.

      k = 1.

      WHILE k NE numclusters + 1.

        distances[ k ] = distance( tuple = rawdata[ i ]-table

                                   vector = centroids->*[ k ]-table ).

        k = k + 1.

      ENDWHILE.



      newcluster = minindex( distances = distances ).

      IF newcluster <> clustering->*[ i ]-id.

        changed = abap_true.

        clustering->*[ i ]-id = newcluster.

      ENDIF. " else no change

      i = i + 1.

    ENDWHILE.



  ENDMETHOD.

Name	DISTANCE
Description	Measure distance from data to centroid by using Euclid calculation
Level	Instance Method
Visibility	Private

Parameter	Type	Pass By Value	Typing Method	Associated Type
TUPLE	Importing		Type	ZKS_COL_KMEANS_ATTR_TABLE_TT
VECTOR	Importing		Type	ZKS_COL_KMEANS_ATTR_TABLE_TT
RETVAL	Returning		Type	INT4

  METHOD distance.



    DATA: sumsquareddiffs TYPE i VALUE 0,

          tuplelength     TYPE i,

          j               TYPE i VALUE 1,

          power           TYPE i VALUE 0.



    tuplelength = lines( tuple ) + 1.



    WHILE j NE tuplelength.

      power = ipow( base = tuple[ j ]-value - vector[ j ]-value exp = 2 ).

      sumsquareddiffs = sumsquareddiffs + power.

      j = j + 1.

    ENDWHILE.

    retval = sqrt( sumsquareddiffs ).

  ENDMETHOD.

Name	MININDEX
Description	Cluster id of the cluster that has centroid closest to the tuple
Level	Instance Method
Visibility	Private

Parameter	Type	Pass By Value	Typing Method	Associated Type
DISTANCES	Importing		Type	ZKS_COL_KMEANS_DISTANCES_TT
NEWCLUSTER	Returning		Type	INT4

  METHOD minindex.

    DATA: indexofmin      TYPE i VALUE 0,

          smalldist       TYPE i VALUE 0,

          k               TYPE i VALUE 1,

          distanceslength TYPE i VALUE 0.



    distanceslength = lines( distances ) + 1.

    smalldist = distances[ 1 ].

    WHILE k NE distanceslength.

      IF distances[ k ] < smalldist.

        smalldist = distances[ k ].

        indexofmin = k - 1.

      ENDIF.

      k = k + 1.

    ENDWHILE.



    newcluster = indexofmin.



  ENDMETHOD.

Name	COMPUTECENTROID
Description	Determine centroid values
Level	Instance Method
Visibility	Private

Parameter	Type	Pass By Value	Typing Method	Associated Type
RAWDATA	Importing		Type	ZKS_COL_KMEANS_RAWDATA_TT
CLUSTERING	Importing		Type Ref To	ZKS_COL_KMEANS_CLUSTERING_TT
CLUSTER	Importing		Type	INT4
MEANS	Importing		Type Ref To	ZKS_COL_KMEANS_MEANS_TT
CENTROID	Importing	X	Type Ref To	ZKS_COL_KMEANS_CENTROIDS_TT

  METHOD computecentroid.



    DATA: numattributeslength TYPE i,

          rawdatalength       TYPE i,

          i                   TYPE i VALUE 1,

          j                   TYPE i VALUE 1,

          mindist             TYPE f VALUE '1.7976931348623157E+308',

          attributecounter    TYPE i,

          wa_centroid         LIKE LINE OF centroid->*,

          centroidlength      TYPE i,

          attributes          TYPE REF TO zks_col_kmeans_attr_table_tt,

          wa_attributes       LIKE LINE OF attributes->*.



    numattributeslength = lines( means->*[ 1 ]-table ).

    rawdatalength = lines( rawdata ) + 1.



    CREATE DATA attributes.

    WHILE attributecounter NE numattributeslength.

      APPEND wa_attributes TO attributes->*.

      attributecounter = attributecounter + 1.

    ENDWHILE.



    CREATE DATA centroid.

    wa_centroid-table = attributes->*.

    APPEND wa_centroid TO centroid->*.



    WHILE i NE rawdatalength.

      DATA: c TYPE i.

      c = clustering->*[ i ]-id.

      IF c EQ cluster.

        DATA currdist TYPE f.

        currdist = distance( tuple = rawdata[ i ]-table

                             vector = means->*[ cluster + 1 ]-table ).

        IF currdist < mindist.

          mindist = currdist.

          centroidlength = lines( centroid->*[ 1 ]-table ) + 1.

          j = 1.

          WHILE j NE centroidlength.

            centroid->*[ 1 ]-table[ j ]-value = 

               rawdata[ i ]-table[ j ]-value.

            j = j + 1.

          ENDWHILE.

        ENDIF.

      ENDIF.

      i = i + 1.

    ENDWHILE.



  ENDMETHOD.

Asymptotic Time Complexity:

Please find asymptotic time complexity analysis results of methods in K-Means class, below;

Method	Value
CLUSTER	n^2
INITCLUSTERING	n
ALLOCATECENTROIDS	n^2
ALLOCATEMEANS	n^2
UPDATEMEANS	n^2
UPDATECENTROIDS	n
ASSIGN	n^2
OUTLIER	n^2
DISTANCE	n
MININDEX	n
COMPUTECENTROID	n^2

Summary:

This is an ABAP implementation of k-Means algorithm. Of course this code can be written more optimized way. Or may add and remove or merge with some other functionalities regarding problem needs.

Please find transport request files under Github. Import transport request into your sandbox or development system.

Reference:

Data Clustering - Detecting Abnormal Data Using k-Means Clustering

About Author:

Please find my personal bio orkun.gedik4#about;

I have been working at KoçSistem as a SAP basis technology executive team leader and principal SAP technical architect, since 2013. With my 26 years experience on ABAP development and 21 years experience on SAP basis fields, during my consulting engagements I worked on various technologies such as cloud, robotic programming, ML, mobile application development, application server programming and many fields and platforms. Currently studying at Gazi University MSc. Computer Science department, since 2021.

SAP Community Network topic leader @

2011/2012 - Databases and OS Platforms

2012/2013 - SAP On Oracle

pbaumann · ‎02-03-2022

Hi orkun.gedik4!

I think this is somehow really cool. Well written and structured.

Just to understand your intension. As you have written on HANA systems PAL is available and you should be able to use PAL here e. g. via AMDP. Why writing it in ABAP as ABAP on Any DB becomes increasingly lesser relevant?

abo · ‎02-03-2022

Cool! Have you considered using abapGit and publishing the code, for example, also on GitHub?

former_member260034 · ‎02-03-2022

Hi pbaumann08

First of all I am very appreciated by your comments. Thank you so much.

I was expecting such comment and you are first one There are many intentions I have;

If I am not wrong PAL need to be licensed to be able to use officially. Someone may need to use only k-means outliers algorithm without having PAL license
There are still many customers around running on non-HANA environments. So they can use this implementation, if they need
It is open source. Thus ABAP developers and people who are interest with k-means algorithm can debug and understand the calculations. For studying intentions I mean.
You are exactly right. That kind of calculations may push down to DB level as you mentioned but As a basis consultant I can say that do not underestimate power of application server It is still can do something good calculations.

I hope that it is clear. Thank you again.

Orkun Gedik

former_member260034 · ‎02-03-2022

Thank you c5e08e0478aa4727abc4482f5be390b2

Very good idea! I will do that.

Best regards,

Orkun Gedik

abo · ‎02-03-2022

You're welcome. That way people can download the whole code, structures, test code, everything, it's developer nirvana basically 😄

pbaumann · ‎02-03-2022

You are right, license is sometimes a topic. Typically as you are on the application layer and if you stay at the application layer it should be safe. But always als your SAP sales contact first 🙂

Good intensions, thank you for your contribution!

Peter

nomssi · ‎02-06-2022

Hello Orkun,

thanks a lot for taking the time to share.

Note: I am assuming this definition is creating a CHAR and an INTEGER, while you want two integers.

DATA lv_cluster_counter, lv_attribute_counter  TYPE i VALUE 0.

best regards,

JNN

former_member260034 · ‎02-06-2022

Hello jacques.nomssi,

You are correct. I missed this point and fixed it regarding your warning.

Thank you and best regards,

Orkun Gedik

nomssi · ‎02-06-2022

Hello Orkun,

can you please add the definition of the KMEANS_CLUSTERING structure?

I am trying to test with the data from your reference

DATA(attributes) = VALUE tt_atts(

                    ( `Height` )

                    ( `Weigth` )

                   ).

dataset = VALUE #( ( table = VALUE #( ( value = 65 ) ( value = 220 ) ) )

                   ( table = VALUE #( ( value = 73 ) ( value = 160 ) ) )

                   ( table = VALUE #( ( value = 59 ) ( value = 110 ) ) )

                   ( table = VALUE #( ( value = 61 ) ( value = 120 ) ) )

                   ( table = VALUE #( ( value = 75 ) ( value = 150 ) ) )

                   ( table = VALUE #( ( value = 67 ) ( value = 240 ) ) )

                   ( table = VALUE #( ( value = 68 ) ( value = 230 ) ) )

                   ( table = VALUE #( ( value = 70 ) ( value = 220 ) ) )

                   ( table = VALUE #( ( value = 76 ) ( value = 130 ) ) )

                   ( table = VALUE #( ( value = 66 ) ( value = 210 ) ) )

                   ( table = VALUE #( ( value = 77 ) ( value = 190 ) ) )

                   ( table = VALUE #( ( value = 75 ) ( value = 180 ) ) )

                   ( table = VALUE #( ( value = 74 ) ( value = 170 ) ) )

                   ( table = VALUE #( ( value = 70 ) ( value = 210 ) ) )

                   ( table = VALUE #( ( value = 61 ) ( value = 110 ) ) )

                   ( table = VALUE #( ( value = 58 ) ( value = 100 ) ) )

                   ( table = VALUE #( ( value = 66 ) ( value = 230 ) ) )

                   ( table = VALUE #( ( value = 59 ) ( value = 120 ) ) )

                   ( table = VALUE #( ( value = 68 ) ( value = 210 ) ) )

                   ( table = VALUE #( ( value = 61 ) ( value = 130 ) ) )

                  ).



numattributes = lines( attributes ).

numclusters = 3.

maxcount = 30.

but I have not figured out yet whether my code is correct. I think an output like in the demo would be useful. Could you please share the expected results for unit testing?

best regards,

JNN

former_member260034 · ‎02-06-2022

Hi jacques.nomssi,

Your code is looking correct. Regarding values you provided on your sample; please find expected outliers test result with 3 clusters and 30 repeat, below. You can optimize outliers method or modify as your needs.

( value = 65 ), ( value = 220 )

( value = 67 ), ( value = 240 )

( value = 68 ), ( value = 230 )

( value = 66 ), ( value = 210 )

( value = 70 ), ( value = 210 )

( value = 66 ), ( value = 230 )

( value = 68 ), ( value = 210 )

You can add as many fields as you need into your dataset as you made. Additionally you can import transport requests on Github link provided over the article under summary header. So you can get away from creating whole structures, including class and sample program.

Please feel free to ask, if you have further questions. Thank you.

Best regards,

Orkun Gedik

pbaumann · ‎02-06-2022

Maybe interesting regarding your points 2 and 3.

Typically every SAP NetWeaver-based ABAP system should include a - even often not used - BW system. BW includes a Data Mining workbench (Tx RSDMWB) and an Analysisprocess Designer (Tx RSANWB). This includes some typical classical algorithms as K-Means for Clustering.

If interested, just check if your system includes function group RS_DME_CLUSTERING_KMEANS or if transactions mentioned are working.

As this modeling is InfoObjects-based it is maybe not so easy to be handled by non-BW guys but maybe it is interesting so see for developers how it is implemented by SAP.

former_member260034 · ‎02-06-2022

Sure. Very good approach. But, it is just a k-Means algorithm implementation. RS_DME_CLUSTERING_KMEANS do not have outliers. But it can be enhanced.

Best regards,

Orkun Gedik

former_member185511 · ‎02-07-2022

hi orkun.gedik4 great article and effort first of all. I used PAL library for many scenarios and it comes with HANA runtime license. You need to install and use straight away. Only limitation here is, if you want to implement your own PAL procedures(with some tools), it is only allowed in developer license. For non-hana environments i think above is the easiest solutions. I even did a POC to deploy a docker image to SAP cloud foundry for some statistical calculations via Python :D.
Of course anything in ABAP would be better as long as performance is not a concern.

nomssi · ‎02-09-2022

Thank you Orkun, I got it to work now.

I am probably missing something, but your outlier computation of a mean distance

  total_dist = total_dist / rawDataLength.

seems incorrect to me. Shouldn't it be

  total_dist = total_dist / ClusterDataLength.

With ClusterDataLength beeing the number of data in the cluster?

best regards,

JNN

former_member260034 · ‎02-09-2022

Hi Nomssi,

total_dist = total_dist / rawDataLength is correct. This because rawDataLength stores total number of elements of whole tuples. For example in your example, its value is 20.

But you can modify and apply your solution regarding your needs as I mentioned.

Best regards,

Orkun Gedik