“Don’t worry, this test only breaks the build sometimes”
It seems inevitable in a non-trivial development project that a test may occasionally flicker, and then work on the next build. We may assume that a test that passes 90%+ of the time is testing code that works 100% of the time in practice. That assumption may in fact be wishful thinking.
Ignoring the possibility that the test is finding a real bug… and I’m not sure you should. Let’s have a look at the general problem of failing tests.
Why is it so bad?
Let’s say your build is not incredibly quick. A build pipeline might be 20 minutes in total. Any flicker in any test that flickers can lose you an average of 10 minutes as you have to restart that build, assuming you notice immediately.
Let’s say your flickering test goes wrong 1% of the time. You probably won’t notice it much and life will go on, except on the rare occasions where you get unlucky (perhaps owing to a busy build server) and lose most of an hour as a result of it.
But flickering tests tend to come in flocks. For each flicking test you have, there’s a chance it will trigger. So how likely is it?
Let’s say there are 100 tests, each with a 1% chance of failing for no reason. The chance of the build failing is (not 100%) about 2/3rds. See the explanations here for why. In essence the probabilities stack up. In practice there won’t be 100 potentially flickering tests in your build, but each one’s probability may be >1%.
The more you leave flickering tests without diagnosing their root causes and fixing them, the worse your build gets, to the point where you’re chasing a working build for a couple of hours a week+.
And this is still assuming that the test isn’t pointing out a real bug!!!
Why do they happen?
Not all tests flicker because they’re wrong. Sometimes the environment really is the issue. Common root causes for test flickering seem to be:
- Order of test execution – indicating tests were set up relying on the order, or erroneously not setting themselves up and getting lucky
- Availability of real world resource – such tests are not unit tests and are more potentially flaky – their failure may indicate a lack of resilience in the real world
- Random input data – sometimes the random input data can represent an incoherent scenario – e.g. searching assuming unique values when the random generator produced duplicates
- Timing – we could enter a world of pain trying to perfect concurrent tests. These issues come down to:
- Race conditions – where things don’t happen in the right order
- Propagation timeout – where you just need to wait a bit longer for things to resolve themselves, but the test timeout isn’t configure long enough
- Poor sleep – where a test that’s retrying doesn’t sleep long enough to let the code under test do the thing it’s waiting for, thus timing out
- Poor design for testability – where you have to test a complex concurrent system and hope for the best, because there’s no way to get into the code under test to test certain scenarios.
- Impatience – where a test has not been set up to wait long enough on a build server – a bit like propagation timeout, but worse, because there’s been no consideration of timeout
- Lack of resource management – where automatic resource clean-up fights the test because the test isn’t holding things open long enough
- Bleed of state between tests – essentially related to test order, but more fundamentally flawed – some shared values are simply not being separated by test, so concurrency or order is always going to produce odd effects
While some tests may just incomprehensibly flicker, tolerating flickering tests is like weighting the dice against yourself. The root cause of a flickering test is often a poor bit of test or code design and a fixed flickering tests is often easier to read and more convincing in terms of what it proves about the code under test.
Take a nearly-zero-tolerance approach to flickering tests and you won’t have to say shall we try building it again and see if that works…