Your coverage score is lying to you. Not deliberately — it is doing exactly what you asked. You wanted a number, and you got one. The trouble starts when you treat that number as meaningful signal about release confidence.
A test suite with 87% line coverage and a test suite that gives you genuine confidence in a release are two completely different things. They can coexist. In most teams I have audited, they don't.
A number without context is just noise
Coverage tells you which lines were executed during a test run. Nothing more. It says nothing about whether those lines were executed with the inputs that matter, whether assertions checked anything meaningful, or whether the code behaves correctly at the boundaries that actually break in production.
I have reviewed test suites sitting at 90% coverage with zero tests touching error handling. The happy path was exhaustively covered. The code paths that execute when a third-party API returns a 429, when a database write times out, or when a file comes through with unexpected encoding — untested, unexecuted, invisible to the coverage report.
Those are the paths that fail on a Friday at 4pm.
The question a Head of Testing should be asking is not "how do we get to 100% coverage?" It is: do we know what will break this system, and will our tests tell us before a user does?
Three ways coverage misleads you
Assertion density. A test that executes twenty lines and asserts nothing about the outcome still counts toward coverage. It runs. It passes. It lifts your percentage. It proves nothing about correctness. I have found production test suites where a quarter of the tests contained no assertions at all — side effects of copy-paste development and no peer review on test code.
The boundary problem. Coverage tells you a function was called. It does not tell you whether you called it with the edge cases that matter. A loop that processes N items might be tested with N=3 in every test, giving 100% line coverage, and never tested with N=0, N=1, or N=10,000. Those are the inputs that break systems under real conditions.
The integration gap. Line coverage is typically measured within a single process. Every integration point — the database connection, the external API call, the message queue, the file system — is either mocked out or ignored entirely. Your coverage percentage climbs while an entire class of failure modes remains invisible.
What AI test generation is doing to this
Large language models are genuinely good at generating test code that executes. They pattern-match against existing tests, extrapolate test cases from function signatures, and produce syntactically correct assertions at speed. Give a well-prompted LLM a function and it will return a reasonable happy-path test within seconds.
The problem is that coverage optimisation is exactly what they are good at, and it is the wrong thing to optimise for.
I have seen teams at 95% coverage with test suites that have never run against a real database. Every external call mocked. Every error response stubbed. Every authentication flow bypassed at the test level. The code is exercised. The system is not tested.
When a QA engineer writes tests, they bring heuristic knowledge about what breaks — malformed input, race conditions, permission boundary cases, third-party failure modes, the weird thing that happened last quarter. When an LLM generates tests, it brings statistical pattern-matching. Both produce tests. Only one of them systematically thinks about what might go wrong.
AI test generation will raise your coverage number faster than any tooling investment you have made. It will not raise your release confidence at the same rate. In teams that do not understand the difference, it makes the gap between the two invisible — which is a worse outcome than a lower coverage score and honest signal.
What to measure instead
Replace coverage as your primary metric with two questions:
- Does our test suite find defects before they reach production?
- When it fails, do those failures accurately predict incidents?
The first question is answered by tracking your defect escape rate — defects found in production versus defects caught in the build pipeline. If that ratio is not improving over time, your test suite is not doing its job, regardless of the coverage number.
The second question requires looking at your CI run history and categorising failures. How many of your recent test failures were actual defect catches? How many were environment instability, flaky network calls, or race conditions in test infrastructure? A test suite that cries wolf most of the time trains your team to ignore failures — which is a more dangerous outcome than low coverage.
Coverage still has a role, but it is a narrow one: risk-weighted coverage in the areas that matter most. Not a single headline number across a whole codebase, but tracked, deliberate coverage of the code paths that carry the highest risk — the payment flow, the data migration logic, the authentication boundary. That is a meaningful audit. "We are at 87% across the board" is not.
The senior QA conversation
The question is not "how do we get to 100% coverage?" It is: do we know what will break this system, and will our tests tell us before a user does?
If you cannot answer that second question with confidence, the coverage score is a distraction. You have built a system for measuring activity, and you are mistaking it for a measure of confidence.
Audit your most recent production incident. Go back to the test suite. Was there a test that should have caught it? If yes — why didn't it fire? If no — why not? That investigation is worth more than three months of watching a coverage percentage climb.
Coverage will tell you what you tested. It will not tell you whether you tested the right things. That distinction is the entire job.
real confidence?
A QA Health Check audit finds the gaps in 1–2 weeks. From £1,200.