Flaky tests are a design problem, not a tooling problem

"Our tests are flaky. We're shopping for a better test runner."

No, you're not looking at a runner problem. You're looking at a mirror.

In 20 years of being handed "flaky" suites, I've almost never found the flakiness living in the tooling. It lived in the system under test — and the team had quietly agreed not to look at it. A test that passes on Tuesday and fails on Wednesday isn't broken. It's often the most honest thing in the building.

Here's the heresy: flakiness is non-determinism, and you built it in. Shared state nobody resets. A sleep() standing in for a real wait. Tests that depend on the order they happen to run in, on the wall clock, or on a network call to something held together with hope. Martin Fowler made the point years ago — a non-deterministic test is worse than no test, because it trains the whole team to shrug at a red build.

Swapping the runner to fix that is like rotating the tyres because the engine knocks. New kit, same noise, lighter wallet.

What actually fixes it:

Treat each flaky test as a bug report about your design, not a nuisance to retry until green.
Kill shared mutable state between tests — fresh fixtures, nothing left over from the last run.
Replace every sleep() with an explicit wait on the real condition.
Make each test own its data, so they stop fighting each other over the same row.
Quarantine the offender loudly — with an owner and a ticket, not the "muted temporarily since 2023" graveyard.

A "retry up to 3 times" flag isn't a fix. It's a way to launder a known defect into a green tick and call it a pipeline.

Flaky tests aren't lying to you. They're the one part of the system still telling the truth.

What's the flaky test your team keeps re-running instead of reading?

Talk to a senior QA consultant

Ready to build quality
into your process?

A QA Health Check audit finds the gaps in 1–2 weeks. From £1,200.

Book a free call →

Flaky tests are a design problem, not a tooling problem

Keep reading.

Your CI pipeline is green. Your system isn't.

QA gets cut first because it measures itself in the wrong currency

AI won't fix your test strategy. It will accelerate the mess.