Some Foundations

Slava · ‎02-22-2024

Recently we were swarmed by flaky tests in one of the projects and our team had to deal with many of them. Often this job is picked up by young developers not having much experience in troubleshooting unstable tests, and for that reason requiring some guidance in the process. Rather than repeating the principles of troubleshooting flaky tests to each of them I decided to create this blog post in hope to "teach them to fish".

Some Foundations

Let's start with some basics of what a good test is and then see how it applies to unstable tests. There is an old software development wisdom: Good tests are FIRST. That is, they are Fast, Isolated, Repeatable/Readable, Self-verifying and Timely. When it comes to unstable tests usually it's the principles of Test Isolation and Repeatability that get compromised.

Where to Start

I recommend starting at the assertion that is failing. I presume the test is Self-Verifying and has very clear unambiguous assertions that verify correctness of the behavior. Otherwise, the test would not be flaky.

Make a note of the state being verified and start analyzing the production code that manipulates the state. Usually, I place debug log output in the code dealing with the state verified by the test. Then, I run the unstable test(s) and analyze the log output for differences between success and failure runs (a diff tool helps).

If you know the process under the test very well, and have clear understanding of what should be happening, analyzing the success output may be skipped - just looking at the log output for the test failure, and making sure it matches the mental picture, may be enough.

Where to place the logging statements and what exactly to include in the log output is an art. It will be mastered with experience, but for now a few advises:

Avoid unnecessary noise in the output.
The debug output may be called from very many other places besides the code being tested, so either place the log statements under certain conditions already existing in the production code or add new conditions around the log statements just for the sake of the test troubleshooting (just like the conditional breakpoints). This may be the case, for example, when testing a listener, which may be hit by so many different events besides those expected in the test.
Make sure to include the state that fails verification, as well as the state of the conditions around the state modifications in the production code.
For example, if a job state is being tested and the job state depends on some records selected from the database, print out those records.
If the state manipulation happens under a condition, add output to the very begging of the method, to know that it was called. Otherwise, the condition may be false, there will be no output in the log, and you may be misled to believe the method was not called at all.
If you observe an unexpected concurrent call and don't know what class makes it, log a stack trace, i.e. new Exception(), in that method invocation.

Based on the log output analysis, you should build some hypothesis about what's wrong. Next step is to verify it. If it's proved - you've found the problem; if not, go back to strategically placing the debug log statements and tracing.
NOTE: You may place breakpoints and debug instead of tracing, but debug does not work well for finding concurrency problems as "observation changes the outcome". However, once you have a hypothesis, debugging can be used to pause the threads at the right places and to control the order in which they are released.

Hypothesis

Based on our experience the hypothesis is built around one of the following problem corresponding to the good test principles.

Test Isolation

Usually problems related to test isolation manifests themselves through a temporal dependency between tests, i.e. the tests pass when executed in one order, but fail when executed in a different order. Test split is a special case of this problem when two or more tests are executed at the same time. Parallel test execution may introduce isolation and concurrency problems.

If you found that the test run order matters, start looking for state shared between different test classes/methods. This shared state may be:

a static field
a shared service. That is, if one test changes the service configuration parameters or replaces it with a mock, etc. That is something the Spring has DirtiesContext annotation for
application configuration properties
file system
database.

It helps to find other tests, the flaky test is secretly affected by. If you've placed the trace output as it's recommended above, it should not be a problem. But we have a few rules to help dealing with this kind of problem:

All manipulations for the setup must be explicitly done in the test and must be reset back to the original state after the test(s) execution. That is true not only for database and file system, but also for the static fields, services and configuration.
Although, Writing and Executing Tests recommends extending ServicelayerTransactionalTest, we are very careful following that advice or using @Transactional tests because, besides hiding transaction management bugs in production code, they also do not clean data created in threads other than the test thread. Transactional tests is an excellent way to provide isolation at the data level, but make sure the production code does not declare transactions and does not create data in parallel threads.
For cleaning data and resetting the state we use many Rules/Extensions. In commerce, there is ItemCreationListener that is used by HybrisJUnit4Test, which is a base class for all JUnit integration tests. This is a clever solution for the data cleaning problem, which works even for data created in other threads.
We use a convention to include the test name into any setup data we create. For example, if we need a Catalog for the test, it's ID will include short name of the test and maybe additional "name", if needed. That way, when tracing the problem and print out an unexpected Catalog in the log output, we know exactly which test has left it behind or is running in parallel because of the testSplit.
We use WHERE clauses for the database records assertions made in the test. That is, if the production code creates some products, we do not select all products from the database, but only those we expect to have. For example, only products in that test specific catalog we've created.
We avoid using project sample data for the tests and create all data unique to the test. We found that with different test split in CI it's possible that two different tests manipulate the same project sample data concurrently. For the same reason, we avoid relying on essential data. Although we have tests for the essential data - they must be tested too. The only solution we found so far is to place all tests for the essential data into a single class.

Repeatable Tests

Violation of this principle leads to flaky tests, which are much harder to troubleshoot. If you've excluded a Test Isolation problem in your research, then it's probably one of the following.

Randomness in the test

This means the test uses some random values, which are often values used for variables or database record fields that must be provided but make no difference in the test, e.g. a random product code, etc. To deal with this problem we have a convention to never use random values in the test. Instead, we explicitly put a value that identifies this piece of data as irrelevant. For example, we use values like "does not matter", "irrelevant", "not used by test", etc. Besides, we may include the test name into those values as it was described above.

Another source of randomness is time based conditions in the production code; so that, whether the test passes or not depends on the time/date when it's executed. To deal with the latter problem we sometimes had to add more tracing log output into the code and leave the test unfixed until it fails again in CI, and that time we have more output to analyze.

Over-specification

This is the case when big bodies of text are evaluated "letter to letter" against an expected sample. Usually this kind of assertion is done for request/response bodies, generated files, etc. If order of the elements in that generated body is slightly different it fails the comparison, although the generated body of text is still valid.
A solution would be to parse the generated text and then evaluate it using XPath, JsonPath or regular expressions with the consideration the order may be different.

Order in Collections

Sometimes tests expect elements to be in a certain order for collections that are not ordered by definition, i.e. sets, maps, etc. Depending on where the test is running the order of elements in the collection being asserted may be different. Be considerate whenever you assert something in certain order and adjust the assert statements respectively.

Concurrency bugs

Let's start with the fact: these tests are harder to write because the test flow must be synchronized with the background processing. Either the test thread is blocked until something happens in the background (this may be done by using Latches, Barriers, Semaphores, etc., but essentially it goes down to usage of Object.wait(timeout) under the hood); or the test thread is polling the state and waits until it changes. This is usually done by using Java Futures, or utilities such as Awaitility for JUnit or PollingConditions.eventually() for Spock.
Less experienced developers have a tendency to just increase the wait/timeout for the condition in hope that fixes the problem. Indeed, sometimes it's exactly the problem, but timeout increases should be always a red flag and should be carefully evaluated. First, often it's not the cause of the failure and the root problem is a not thread safe code. Second, it may indicate a performance degradation.

General recommendation would be to use the tracing technique to search for race conditions around the state failing the assertion. As you go through the code and look for places where the debug output should be added, pause and think what will happen when this particular code is executed by multiple threads. If you see a potential race condition, add the log output to confirm your hunch. These are some ideas to consider:

Make sure the asynchronous process completely finished before making the assertion and leaving the test method. There may be more happening besides the assertion in the eventual condition: once the condition is evaluated the test is done, but the background processing may continue: it may fire more events, keep running a cron job, etc, - all that activity may spill into the next test method and affect its results.
Pay attention to places where condition check is done separately from an action/use. That is, production code selects some data, evaluates it and then does something based on that evaluation result.
For example, we query webhook configurations to decide whether a business event should be processed, then we start processing, if found. By the time we get to sending the event a configuration might have been deleted and the event can't be sent; or a new configuration created, and the event should have been sent to more destinations. Check of the configurations state and actions upon them should be done atomically. Often the solution is the old Object-Oriented wisdom: "Tell - don't ask". Examples of such atomic operations in Java are AtomicInteger.getAndIncrement() or Map.getOrDefault().
Some developers think database operations are not subject for concurrency bugs because they are protected by the database transaction isolation level. In reality, it's not always true. There may be bugs around the transaction boundaries in the code, and the code may see uncommitted (ghost) changes.
Pay attention to a common state, even potentially shared by more than one thread. This is the class and static fields, as well as sometimes parameters (mutable non-primitives). You don't have to be concerned about "protecting" local variables in methods. This is to consider what to include in the debug output and how much code should be placed inside a synchronized block.
Once you found the problem, often adding a Thread.sleep() to the right place will turn a flaky test into a consistently failing, which helps to confirm the fix. Just don't forget to remove that sleep before the final PR.
Make sure to log all exceptions caught. Sometimes an unexpected "random" exception kills the production thread and affects the expected result.

Conclusion

The topic of unstable tests is complex and hard to convey. I'm sure I've forgotten many reasons for why a test was failing and how we dealt with it, so any comments with more ideas would be greatly appreciated. I still hope this blog has value and helps less experienced developers to perform a good analysis of a flaky test.

Dealing with Unstable Tests

Some Foundations

Where to Start

Hypothesis

Test Isolation

Repeatable Tests

Randomness in the test

Over-specification

Order in Collections

Concurrency bugs

Conclusion

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win