"Our tests are flaky. We're shopping for a better test runner."
No, you're not looking at a runner problem. You're looking at a mirror.
In 20 years of being handed "flaky" suites, I've almost never found the flakiness living in the tooling. It lived in the system under test — and the team had quietly agreed not to look at it. A test that passes on Tuesday and fails on Wednesday isn't broken. It's often the most honest thing in the building.
Here's the heresy: flakiness is non-determinism, and you built it in. Shared state nobody resets. A sleep() standing in for a real wait. Tests that depend on the order they happen to run in, on the wall clock, or on a network call to something held together with hope. Martin Fowler made the point years ago — a non-deterministic test is worse than no test, because it trains the whole team to shrug at a red build.
Swapping the runner to fix that is like rotating the tyres because the engine knocks. New kit, same noise, lighter wallet.
What actually fixes it:
- Treat each flaky test as a bug report about your design, not a nuisance to retry until green.
- Kill shared mutable state between tests — fresh fixtures, nothing left over from the last run.
- Replace every sleep() with an explicit wait on the real condition.
- Make each test own its data, so they stop fighting each other over the same row.
- Quarantine the offender loudly — with an owner and a ticket, not the "muted temporarily since 2023" graveyard.
A "retry up to 3 times" flag isn't a fix. It's a way to launder a known defect into a green tick and call it a pipeline.
Flaky tests aren't lying to you. They're the one part of the system still telling the truth.
What's the flaky test your team keeps re-running instead of reading?
into your process?
A QA Health Check audit finds the gaps in 1–2 weeks. From £1,200.