ABAP for newbies – Importance of BINARY SEARCH (3)

former_member571549 · ‎03-09-2020

Hi,

This is the third post about the subject, if you are interested in the subject you can take a look at:

Importance of BINARY SEARCH

Importance of BINARY SEARCH (2)

I'm going to test the performance of different techniques, first from a READ perspective and then from an INSERT/DELETE perspective, i'm using a developer version and i have no SAP modules, i will use very simple internal tables for the test.

I'm running SAP on a virtual machine and i found the performance not very consistent so i ran the tests 100 times and i will be using an average time for the results.

The read techniques we are about to see are:

BINARY SEARCH - Both tables will be STANDARD TABLES, head table needs to be sorted with a SORT statement, and i will be reading entries with BINARY SEARCH, this option can read only one entry so my loop will be from positions to head table.

LOOP WHERE SORTED - Positions table will be SORTED TABLE i'm going to access the table with partial index to get the benefit of WHERE optimization.

LOOP HASHED - Head table will be HASHED so i will be looping positions and i will look for the entry on the head table, the reason for this is that i have to access hashed table using full index and from head to position i would have only partial index (i would miss position number).

LOOP WHERE SECONDARY SORTED - Positions table is STANDARD TABLE but it has a SECONDARY SORTED INDEX i will be using for the loop.

LOOP SECONDARY HASHED - Head table is STANDARD TABLE but it has a SECONDARY HASHED INDEX i will be using for the loop, the loop is from positions to head as in the hashed table test.

PARALLEL CURSOR WITH READ - For this option i will do a binary search to find the index of the first position that match the head entry and i start looping from that index until i found a different key, both tables are STANDARD TABLES and positions table needs to be sorted with a SORT statement.

PARALLEL CURSOR WITHOUT READ - For this option i will be using STANDARD TABLES, both need to sorted with a SORT statement, then i can assume first head entry matches with the first position entry and i can assume the positions index where i find a different key matches with the next head entry so instead of do a READ TABLE for each head entry i just keep the last readed index.

Some of the tests were seen on the first post, the new options we are about to test are:

Added tests for hashed tables, i'm not going to explain why this tables are good because i don't know much about the algorithm, are usually recommended for programs which mostly needs to read data, and not so much recommended when the program needs to insert/delete entries to/from the internal table.

Added tests to check the performance of STANDARD TABLES using secondary indices, we will see if secondary index perform as good as primary.

I got some info of parallel cursors and i added the tests for the better performing option parallel cursor without read.

READ TESTS

For read test i will not add the sort time to BINARY SEARCH and PARALLEL CURSOR techniques because the use of sorted and hashed tables have some performance overhead too when populated.

You can see exactly the code tested in the github repository at the end of the post, program YJRS_ABAP_003.

First test

Seed: 1

Head entries: 50000

Positions per head: 1

Executions: 100

Results:

Total loops done: 50.000

STANDARD TABLES - SORT TIME - 29.922 ms

BINARY SEARCH - 47.903 ms

LOOP WHERE SORTED - 116.885 ms

LOOP HASHED - 31.899 ms

LOOP WHERE SECONDARY SORTED - 120.095 ms

LOOP SECONDARY HASHED - 31.572 ms

PARALLEL CURSOR WITH READ - 69.499 ms

PARALLEL CURSOR WITHOUT READ - 24.603 ms

For this test PARALLEL CURSOR WITHOUT READ is the best performance option,and SORTED tables or index are not having a good performance, not even close to BINARY SEARCH.

Second test

Seed: 1

Head entries: 50000

Positions per head: 5

Executions: 100

Results:

Total loops done: 250.000

STANDARD TABLES - SORT TIME - 72.149 ms

BINARY SEARCH - 207.195 ms

LOOP WHERE SORTED - 152.801 ms

LOOP HASHED - 125.800 ms

LOOP WHERE SECONDARY SORTED - 161.138 ms

LOOP SECONDARY HASHED - 126.957 ms

PARALLEL CURSOR WITH READ - 90.607 ms

PARALLEL CURSOR WITHOUT READ - 46.742 ms

Once again PARALLEL CURSOR WITHOUT READ is performing the best, HASHED table and index are not as close to PARALLEL CURSOR as in the first test and BINARY SEARCH start to struggle.

Third test

Seed: 1

Head entries: 50000

Positions per head: Random number between 1 and 10

Executions: 100

Results:

Total loops done: 275.164

STANDARD TABLES - SORT TIME - 40.195 ms

BINARY SEARCH - 226.745 ms

LOOP WHERE SORTED - 157.758 ms

LOOP HASHED - 139.410 ms

LOOP WHERE SECONDARY SORTED - 165.673 ms

LOOP SECONDARY HASHED - 143.887 ms

PARALLEL CURSOR WITH READ - 92.912 ms

PARALLEL CURSOR WITHOUT READ - 48.403 ms

When number of positions are randomized we get results very close to the second test.

Conclusions

Secondary indices seems to have a performance very close to primary ones.

For my data tables it seems that if we only have one position parallel cursor without read, hashed table and index and read table binary search are good options, but as the number of positions grow parallel cursor becomes better and binary search is no longer an option. It is not an extrapolated conclusion for any case, i recommend to perform your own tests if you need to tweak a program, you can download YJRS_ABAP_0003 program form github repository you will find at the end of the post and modify it to fit your needings.

INSERT/DELETE TESTS

Now we have some techniques for reading but some of this techniques could have a performance impact if your program needs to insert or delete lots of records, so to get a best judgement on when to apply each one we are going to test the time differences between STANDARD, SORTED and HASHED tables.

When inserting data to STANDARD TABLE i will be using APPEND as it is the recommended statement by SAP for this table typekind. For SORTED and HASHED TABLES i will be using INSERT. SORTED TABLE could be populated with APPEND but only if the line you are appending is really the last one when sorted, for HASHED tables APPEND is not possible.

For STANDARD TABLES DELETE i will be using a BINARY SEARCH to found the entry i want to delete and then i will delete it by index, the other table types are optimized to find the entry you want to delete as faster as possible.

For the test i will be using a table pre filled with 25000 entries, i will add 10000 more and then i delete this 10000 entries. I'm having some issues when deleting from secondary index tables, they don't seem to be doing any optimization, so i will be running this tests just 20 times instead of 100 times to avoid timeout.

Seed: 1

Head entries: 5000

Positions per head: 5

Total entries: 25000

Executions: 20

Results for INSERT one by one:

APPEND STANDARD TABLE - 1.133 ms

INSERT SORTED TABLE - 3.477 ms

INSERT HASHED TABLE - 3.085 ms

INSERT SECONDARY SORTED TABLE - 4.285 ms

INSERT SECONDARY HASHED TABLE - 3.641 ms

Results for INSERT all entries in one operation:

APPEND STANDARD TABLE - 306 ms

INSERT SORTED TABLE - 1.121 ms

INSERT HASHED TABLE - 2.082 ms

INSERT SECONDARY SORTED TABLE - 6.814 ms

INSERT SECONDARY HASHED TABLE - 6.229 ms

Results for DELETE:

DELETE STANDARD TABLE - 13.009 ms

DELETE SORTED TABLE - 30.916 ms

DELETE HASHED TABLE - 25.566 ms

DELETE SECONDARY SORTED TABLE - 2.074.537 ms

DELETE HASHED SORTED TABLE - 2.156.181 ms

Conclusions

No surprises here for the best option, STANDARD TABLE performs better as it has no overhead in this kind of operations, i'm having issues with SECONDARY INDEX even when specifying the key it must use, maybe i did something wrong, if someone finds any mistake in the code i will be grateful to know.

In the comments Sandra talks about lazy update technique used by the implementation of the secondary indexes, if i have understood the documentation well, INSERTs on seconday index tables does not trigger the index update immediatly but in the next usage, this must allow to optimize multiple INSERT calls that would not re sort the table in each INSERT.

Sandra also found the reason for the DELETE performance on secondary index tables, as specified in documentation there's no optimization for this operations:

deleting a row from a standard table using a secondary key cannot be optimized because the index entry to be deleted must be searched linearly.

Final Conclusions

As seen in the examples READ techniques performance will vary depending on different factors such as table data, i can't give a rule for perfect performance, you must try to know as much as possible about the data you will have and then you can find the best option by doing some testing.

If your program will perform INSERT/DELETE operations intensively you may be forced to use a non optimal READ technique, instead you must choose the technique that produces less overhead in that operations.

In some cases you will need to think about memory too, in such cases avoid hashed tables and secondary indexes.

Personal opinion on BINARY SEARCH addition

In my opinion binary search addition just have one real problem:

If you forgot to sort your table it could be hard to detect, making this issue susceptible to reach productive system.

I also want to share this link The War on BINARY SEARCH where you can read a different opinion so you can make your own choice.

Many people thinks binary search as addition is useless so let's guess in our first test case i used the recommended sorted table but the volume is so big that i'm having performance issue, let's guess also memory is struggling too, if you only read people against of binary search they will say to go for a hashed table to improve performance, and if we only think about performance hashed tables are better than binary search. But you must know that hashed tables needs more memory than standard and sorted ones, so, if i'm having memory issues hashed tables will put even more stress on memory. For the first test case binary search doubles performance without any impact on memory. It was not so difficult to imagine a use case for binary search, you just need to know the pros and cons of each statement you use.

EDIT: As Sandra mentioned in the comments loop it’s not exactly the same in the read binary search and loop sorted tests, I didn’t remember it when writing this point so you can’t expect such a big increase without this difference, in the first post having the same loop binary search performed better by a bit consistently, if you have a case like this the written will apply, remember the title of this point Personal opinion 🙂

EDIT 2: To compare same LOOPs i must be comparing LOOP WHERE SORTED with PARALLEL CURSOR WITH READ which in my particular case performs better for all 3 tests and provides and increase of performance of about 40% in test 1. As explained on Sandra's comment sorted tables must always work at least as fast as binary search, and my argument should always be wrong, but as seen in the tests the theory is not working as expected. I tagged this point as personal opinion because it's not based on theory but in the numbers i really get, i could recommended to follow every guideline but i rather recommended to test if performance really matter, and in my personal opinion this is the important point of the post, do not blindly follow guidelines, understand it and if they are not working good enough for your case try something different.

You can find full code at this github repository: https://github.com/jordirosa/ABAP4Newbies

The repository report programs for this examples are YJRS_ABAP_003 and YJRS_ABAP_004.

Regards

EDIT: I changed the read techniques names to match the tests for and easier understanding of test results, i added a mention to lazy update technique, and i updated personal opinion where i was wrongly comparing the BINARY SEARCH loop with the LOOP WHERE SORTED one.

Sandra_Rossi · ‎03-10-2020

There are lots of things to verify in your post, so I have probably skipped a few things.

READ TESTS

I’m somewhat confused by these tests because you don’t compare the same things.

Your tests should not be compared all together, there are three distinct groups (see below the exact list): those which do a LOOP and a nested READ TABLE, those which do a LOOP and a nested LOOP, and the “parallel cursor”. I understand what you wanted to make the same kind of grouping via the parameter “positions per head” with the special value “1”, but that’s incorrect. Keep things simple, and you lower the risks to get wrong results.
Some tests do the outer loop based on the table of heads, and other tests do it on the table of positions. I saw that because you reported some differences between LOOP WHERE SORTED and PARALLEL CURSOR WITH READ, but they should be theoretically identical.
Minor remark: you differentiate results based on slight performance differences, while they should be considered the same. Like you noticed, if you run the same code twice you never get the same time because the CPU is always doing some things in parallel and even the measurements are themselfves not exact… You always indicate one “champion” (in green) but you'd better often indicate two champions. Same thing for the “losers” (in red).

I advise you to revisit the organizatino of your chapter because the results may confuse a lot ABAP beginners.

Group 1:

BINARY SEARCH
LOOP HASHED (the word “LOOP” is confusing, or rename BINARY SEARCH by prefixing it with LOOP too!)
LOOP SECONDARY HASHED (same remark for “LOOP”)

Group 2:

LOOP WHERE SORTED
LOOP WHERE SECONDARY SORTED
PARALLEL CURSOR WITH READ (it’s not the algorithm of “parallel cursor”; I’d prefer “BINARY SEARCH & LOOP”)

Group 3:

PARALLEL CURSOR WITHOUT READ (in fact, it’s just “parallel cursor”)

INSERT/DELETE TESTS

Nothing to criticize here, except that I’m really surprised by the SO BAD performance of DELETE using the secondary key (whatever it’s sorted or hashed), compared to the one using the primary key. I wonder if it’s related to the handling of the “delayed update”. I’ll try to investigate that.

Note that you would get a “false good” result if you had used a secondary non-unique sorted key instead of unique, because there would be a “lazy update” (see ABAP documentation), in fact it’s a delayed update of the index which happens at the next read.

BINARY SEARCH

The argument about the lower memory needed by the standard index compared to the hashed index is correct, but in that case just go for the sorted index.

The argument that BINARY SEARCH is better than using a sorted key for performance is ALWAYS wrong, because it's the same performance as I explained.

According to me, the only good argument to use BINARY SEARCH is when the context FORCES you to have a standard table (ALV, etc.) That was the comment from Matthew in the part two of your blog series if I remember well.

former_member571549 · ‎03-10-2020

Hi,

Thanks for comment, I will try to review some points.

I updated personal opinion because I didn’t remember I used different loops for read and sorted when writing the last point.

I will keep parallel cursor with read as a name because on some places you can find this differentiation and I think it’s nice to know what are they talking about just by a name.

The green and red ”champions” are based on the average of 100 runs for particular test and thousands of runs while doing the program and having the same average results by a very close percentage, for each particular case I think green option will be the better in average, times vary slightly for each batch of 100 runs but in each one the green option was the better, but I don’t think it’s really important as it applies to my particular case, i just colored it to make it easy to see the best and the worse times.

I will read about lazy update to try to complete the info.

Thanks again to helping me to improve the post.

Sandra_Rossi · ‎03-11-2020

I understand why it takes so much time for DELETE SECONDARY SORTED TABLE and DELETE HASHED SORTED TABLE, it’s not your program but it’s how the kernel works ABAP Doc – Using Secondary Keys:

“The standard scenario for sensibly using secondary table keys relates to a very large internal table that is created in the memory and whose content is changed very infrequently.“
“deleting a row from a standard table using a secondary key cannot be optimized because the index entry to be deleted must be searched linearly.”

former_member571549 · ‎03-13-2020

Thank you, i'm not sure if the post is mine yet, half of the info was provided by your comments haha

Sandra_Rossi · ‎03-13-2020

One more remark after you have updated your post, concerning the lazy update "INSERTs on seconday index tables", more precisely it concerns only non-unique secondary (sorted) indexes. In your tests, you have only unique indexes (in my original comment, I was talking about that just for information).

About the conclusion for READ, "binary search is no longer an option", I still don't agree at all ?

former_member571549 · ‎03-13-2020

About the conclusion for READ, “binary search is no longer an option”, I still don’t agree at all ?

I was totally sure about that 😄

If you don't mind i would like to know if you disagree with the numbers , if you think i'm not doing the right comparison yet, or if you just disagree with using the addition even if there are cases where we can get such performance increase.

Sandra_Rossi · ‎03-13-2020

Yes I disagree, there are two reasons why they are incorrect:

Your tests should not be compared all together, there are three distinct groups
etc.

Some tests do the outer loop based on the table of heads, and other tests do it on the table of positions.
etc.

I'm sure BINARY SEARCH and equivalent read using a sorted key are identical. When I saw that the loops were different (reason number 2), I immediately supposed it was the cause of the differences.

former_member571549 · ‎03-14-2020

The conclusion is just about two tests:

LOOP WHERE SORTED
- From head to positions
- Belongs to group 2 of your first comment
- Positions table is sorted table
- LOOP filtering partial key

PARALLEL CURSOR WITH READ
- From head to positions
- Belongs to group 2 of your first comment
- Positions table is standard table (sorted with SORT)
- READ TABLE filtering binary search then loop using index.

I really don't see how your points 1 and 2 apply in that test cases, you think that same groups and same loops are the way to do the fair comparison then the numbers should be right and manually code the binary search is performing 40% faster for that particular case.

I'm not trying to be right, i'm interested in learning if really exists some reason other than different opinions, but i can't see it yet, maybe if you could write the binary search equivalent code you think i must be executing i would understand.

Sandra_Rossi · ‎03-14-2020

My only concern, if I understand well what you say, you're saying BINARY SEARCH is better than sorted table/sorted key, and I'm saying there is no difference.

But using BINARY SEARCH has the drawback that sometimes SORT isn't done correctly before, so, to avoid problems, don't use BINARY SEARCH (except if an existing program forces you to use a standard table), use a sorted table or secondary sorted key, or even a hashed table/key if possible/worth. Moreover, the option in the debugger to alert if the standard table is correctly sorted (that you have shown in your second blog post), is not a good reason to continue of using BINARY SEARCH.

Now, I hope you don't think/say that BINARY SEARCH has a better performance than a sorted key. Otherwise please tell me what is wrong in my very simple test (I ran it at 4 different times, because as you know, measures are imprecise):

Code:

CLASS ltc_main DEFINITION

      FOR TESTING

      DURATION SHORT

      RISK LEVEL HARMLESS.

  PUBLIC SECTION.

    CLASS-METHODS class_constructor.

  PRIVATE SECTION.

    " NF = not found (read -> sy-subrc = 4)

    METHODS read_binary_search_nf FOR TESTING.

    METHODS read_secondary_sorted_nf FOR TESTING.

    METHODS read_sorted_nf FOR TESTING.

    " F = found (read -> sy-subrc = 0)

    METHODS read_binary_search_f FOR TESTING.

    METHODS read_secondary_sorted_f FOR TESTING.

    METHODS read_sorted_f FOR TESTING.

    CLASS-DATA sflight_s TYPE STANDARD TABLE OF sflight.

    CLASS-DATA std TYPE STANDARD TABLE OF sflight

        WITH NON-UNIQUE SORTED KEY by_carr_conn COMPONENTS carrid connid.

    CLASS-DATA sorted TYPE SORTED TABLE OF sflight

        WITH NON-UNIQUE KEY carrid connid.

ENDCLASS.

CLASS ltc_main IMPLEMENTATION.

  METHOD class_constructor.

    SELECT * FROM sflight INTO TABLE sflight_s.

    std = sflight_s.

    DO 10 TIMES.

      INSERT LINES OF std INTO TABLE std.

    ENDDO.

    SORT std BY carrid connid.

    " build secondary non-unique index (to avoid "lazy update" effect to be counted in performance measure)

    READ TABLE std INDEX 1 USING KEY by_carr_conn TRANSPORTING NO FIELDS.

    sorted = std.

  ENDMETHOD.

  METHOD read_binary_search_nf.

    DO 10000 TIMES.

      LOOP AT sflight_s ASSIGNING FIELD-SYMBOL(<line>).

        DATA(connid) = <line>-connid + 4000.

        READ TABLE std BINARY SEARCH WITH KEY carrid = <line>-carrid connid = connid ASSIGNING FIELD-SYMBOL(<line2>).

      ENDLOOP.

    ENDDO.

  ENDMETHOD.

  METHOD read_secondary_sorted_nf.

    DO 10000 TIMES.

      LOOP AT sflight_s ASSIGNING FIELD-SYMBOL(<line>).

        DATA(connid) = <line>-connid + 4000.

        ASSIGN std[ KEY by_carr_conn COMPONENTS carrid = <line>-carrid connid = connid ] TO FIELD-SYMBOL(<line2>).

      ENDLOOP.

    ENDDO.

  ENDMETHOD.

  METHOD read_sorted_nf.

    DO 10000 TIMES.

      LOOP AT sflight_s ASSIGNING FIELD-SYMBOL(<line>).

        DATA(connid) = <line>-connid + 4000.

        ASSIGN sorted[ carrid = <line>-carrid connid = connid ] TO FIELD-SYMBOL(<line2>).

      ENDLOOP.

    ENDDO.

  ENDMETHOD.

  METHOD read_binary_search_f.

    DO 10000 TIMES.

      LOOP AT sflight_s ASSIGNING FIELD-SYMBOL(<line>).

        READ TABLE std BINARY SEARCH WITH KEY carrid = <line>-carrid connid = <line>-connid ASSIGNING FIELD-SYMBOL(<line2>).

      ENDLOOP.

    ENDDO.

  ENDMETHOD.

  METHOD read_secondary_sorted_f.

    DO 10000 TIMES.

      LOOP AT sflight_s ASSIGNING FIELD-SYMBOL(<line>).

        ASSIGN std[ KEY by_carr_conn COMPONENTS carrid = <line>-carrid connid = <line>-connid ] TO FIELD-SYMBOL(<line2>).

      ENDLOOP.

    ENDDO.

  ENDMETHOD.

  METHOD read_sorted_f.

    DO 10000 TIMES.

      LOOP AT sflight_s ASSIGNING FIELD-SYMBOL(<line>).

        ASSIGN sorted[ carrid = <line>-carrid connid = <line>-connid ] TO FIELD-SYMBOL(<line2>).

      ENDLOOP.

    ENDDO.

  ENDMETHOD.

ENDCLASS.

former_member571549 · ‎03-16-2020

Hi,

I'm saying sometimes i get better performance manually coding the binary search and loop by index than letting sorted table do it automatically for me, and i'm saying that, not as a rule but as a exception, and the number is sometimes big enough not to ignore it as an option.

I also think the code of my comparison is more accurate to compare against sorted table as documentation says:

The starting point for the editing of the table is determined by a binary search using the subconditions that cover the table key in full or in part. From the starting point onwards, the table is only processed for as long as these subconditions are met.

This is exactly what i'm doing with code in the test case i called PARALLEL CURSOR WITH READ, i search the entry with binary search then i loop until subconditions are not met.

It's beyond argument that using binary search as default over a sorted table is not the way to go, so i agree with you in which is the right guideline but i think this is not an unbreakable guideline when i got 40% better performance coding it manually.

I agree with the problem that you can forgot to SORT, but you can forgot to check empty table before a FOR ALL ENTRIES SELECT, you can forgot the addition RESPECTING BLANKS when concatenating strings, etc... and i don't see people stop using it when they need it. If my test would give me same performance always or at least close performance i would not see binary search as an option, but in this case i got 40% performance boost coding manually the same optimization ABAP is supposed to be doing for sorted tables, so for my test case i really think binary search is an option.

Thanks for the code and for taking the time to help me learn, my english is awful and maybe i'm not explaining my viewpoint well enough, i'm not saying binary search is always better than sorted table, but binary search could be sometimes a useful option that much people not even consider.

Sandra_Rossi · ‎03-16-2020

Well, I tried my best to demonstrate with simple and verifiable code and measures.

Just a last word which supports the general orientation of SAP to not use see standard tables when keys are used: SAP recommends to not use standard tables any more with COLLECT, although there's no performance difference. I think one good reason is that it may lead to errors if not used correctly. Instead, they recommend COLLECT with sorted tables with unique keys or hashed tables.