Mutation testing · A methodology note

The Three-Score Problem66.67% / 91.04% / 100%

The same mutation testing run can yield three legitimate scores. Each is in active use across the industry. None of them is wrong. The gap between them is where mutation reports become hard to read honestly.

Ross Webb 25 April 2026

A mutation testing run produces a single headline number, but that number is the result of methodological choices the report rarely explains. Three different ratios can be computed from the same telemetry file, and the difference between them is not noise. It is a measure of how much per-mutant judgement the reporter chose to apply, and how much they chose to leave to the tooling.

The exercise below is one such run. It produced raw 66.67%, detected 91.04%, and adjusted 100% from a single execution against a single module. None of these numbers is wrong. Each requires a different amount of justification before it is fit to publish, and the difference matters most when the report is being read by someone who did not produce it.

§1The dataset

A run against fintx-accounting, an abandoned open-source Java accounting project, in April 2026. Module under test: org.fintx.accounting.service.impl.AccountServiceImpl — 377 lines, cyclomatic complexity 34, ten deterministic seams. Public domain, last commit in 2024.

The original Java was transformed into a Python behavioural port. The test suite, mutation testing, and certificate all cover that port rather than the compiled JVM artefacts.

The substrate produced from this run, all reproducible from the certificate hash in the footer below: a BlastRadiusManifest.json mapping the module's branches, dependencies, and seams; a 43-test Python suite generated against the port; a MutationTelemetry.json showing 279 mutants generated, 186 killed, 68 suspicious, 25 survived, 0 timeouts; and a signed CryptographicParityCertificate.json. That is the input. Now the scores.

§2Three scores from one telemetry file

Mutation testing introduces small, syntactic changes — mutants — into the source code, then runs the test suite against each one. If the suite catches the change, the mutant is "killed." If the suite still passes, the mutant "survives." The mutation score is the ratio of killed to total mutants. Survival can mean two distinct things: the tests fail to verify the affected behaviour, or the mutation produced no observable change at all. Distinguishing those two cases is most of the work of reading mutation reports honestly, and most of what the rest of this note is about.

The controversy starts at the score. Three different ratios can be computed from the same mutmut run, and each is in active use across the industry. None of them is mathematically wrong.

Mutation telemetry · 279 mutants · AccountServiceImpl all segments

186 killed

68 suspicious

25 survived

66.67%unambiguous kills

24.37%flagged ambiguous

8.96%survivors

Raw 186 ÷ 279 66.67%

Detected (186 + 68) ÷ 279 91.04%

Adjusted (186 + 68) ÷ (279 − 25) 100.00%

Raw kill rate counts only unambiguous kills. It is the defensible floor. The number cannot be inflated by methodology choices, which is exactly its value: any reviewer of any mutation report can compute it without trusting the reporter's judgement.

Detected rate counts mutmut's suspicious cohort as kills. mutmut flags a mutant as suspicious based purely on execution time: when the test run for the mutated module takes significantly longer than the unmutated baseline but doesn't hit the hard timeout. The intuition is that a slow run is probably one that's struggling to fail. The heuristic is operational, not semantic. Most mutation tooling treats suspicious as detected by default. The convention silently transfers an ambiguity from the test suite to the reporting framework. Reading detected as a coverage figure requires the reporter to have inspected the suspicious cohort and confirmed that none of them are genuine misses dressed up as slow tests.

Adjusted rate excludes the surviving mutants from the denominator on the grounds that they are equivalent — semantically identical to the original code under all execution paths the system reaches. This is the strongest claim the three numbers can carry, because it removes data rather than adding it. If even one survivor is misclassified as equivalent, the adjusted rate is overstating coverage.

Each successive score therefore depends on a different amount of human judgement. Raw is automatic. Detected requires a defence of the runner's tooling decisions. Adjusted requires a defence of the reporter's own. Mutation reports in the wild routinely quote the highest available number with no methodology disclosed. The next two sections walk through what that disclosure should actually contain, using the surviving mutants from this dataset as the worked example.

Compute the score yourself

Toggle the methodology choices. Watch which score falls out.

Count suspicious as kills

Exclude survivors from denominator

186 ÷ 279 =

66.67%

raw

§3What an equivalent mutant actually looks like

Open the surviving-mutants list in MutationTelemetry.json. There are twenty-five of them. Twenty-one follow one pattern; four follow others. The first one:

MUTANT_ID 30 · account_service_impl.py SURVIVED

line33

originalbalance: Decimal = Decimal("0.00")

mutatedbalance: Decimal = Decimal("XX0.00XX")

classifiedequivalent

The mutation is mutmut's string-replacement strategy. It found a string literal — "0.00" — and inserted its XX…XX markers. The resulting code, Decimal("XX0.00XX"), is not a valid Decimal constructor argument. If executed, it raises decimal.InvalidOperation.

The line is a dataclass field default. In Python, dataclass defaults are evaluated when the class is defined, not when an instance is created. The mutated default raises immediately on import. The module never loads. Every test fails. A mutation runner that detects import-level test failures should record this as a kill.

In the run that produced this telemetry, mutmut v2 didn't. The mechanism is mundane: when a mutated module fails to import, pytest exits with code 2 (collection error). Older versions of mutmut register a kill on exit code 1 specifically and default everything else, including code 2, to "survived." The mutated module never imported, no tests ran, and the runner wrote the result as a survivor. This is a documented bug in the runner's exit-code parsing, not a property of the test suite under examination.

In the mutation testing literature these are called incompetent or stillborn mutants — mutants that fail to produce a runnable program. PIT, the dominant mutation testing tool in the Java ecosystem, isolates them explicitly as RUN_ERROR or MEMORY_ERROR and excludes them from the mutation score by default. mutmut, working at the Python AST level, generates them by design and then misclassifies them at the parsing layer, which is how twenty-one of them ended up in this telemetry's "survived" column. None of this is a property of the code under test.

This pattern accounts for twenty-one of the twenty-five surviving mutants in the dataset. Sequential dataclass field defaults across the module: balance, frozenAmt, drBalance, crBalance, drTransactionAmt, and on through line 76. None of them survives in any meaningful sense. They are stillborn mutants the runner mislabelled.

The remaining four survivors are different in kind. Each is a genuine equivalent with a reachability argument that can be checked by reading the surrounding code.

MUTANT_ID 115 · line 122 SURVIVED · CLASS B

originalnewBalance = Decimal("0.00")

mutatednewBalance = None

newBalance is overwritten in both the PLUS and MINUS branches before any read. The initial value never reaches an observable. Genuine equivalent. Same pattern at mutant_id 218 (newLastBal).

MUTANT_ID 126 · line 134 SURVIVED · CLASS B

originalif frozenAmt < Decimal("0.00"):

mutatedif frozenAmt <= Decimal("0.00"):

When frozenAmt is exactly zero, both branches normalise it to zero. The behavioural output is bit-identical. Genuine equivalent.

MUTANT_ID 184 · line 168 SURVIVED · CLASS B

originalelif day_gap > 1:

mutatedelif day_gap >= 1:

The preceding if day_gap == 1 branch catches the equality case before the elif is reached. The mutation sits on dead code. Genuine equivalent.

These four are real equivalents — each with a one-sentence reachability argument that can be verified by reading the surrounding code. The other twenty-one are not equivalents at all. They are stillborn mutants on a known mutmut v2 parsing bug.

The "all 25 are equivalent" classification that produces a 100% adjusted score is correct numerically. It is also wrong informationally. Without the classification, a team reading "100% adjusted" cannot know that 84% of the exemptions are tooling errors and only 16% are genuine reachability proofs. Both readings produce 100%. Only one tells you what the test suite actually proves.

This is the loose-versus-strict distinction in operational terms. The loose reading: equivalent. The strict reading: twenty-one Class A stillborn mutants on a known mutmut v2 parsing bug; four Class B genuine equivalents with written reachability proofs. The framework that produces the second statement is the next section.

ID	Line	Pattern	Mutation	Class
30	33	dataclass default	balance →"XX0.00XX"	A
32	34	dataclass default	frozenAmt →"XX0.00XX"	A
34	35	dataclass default	drBalance →"XX0.00XX"	A
36	36	dataclass default	crBalance →"XX0.00XX"	A
38	37	dataclass default	drTransactionAmt →"XX0.00XX"	A
40	38	dataclass default	crTransactionAmt →"XX0.00XX"	A
42	39	dataclass default	lastBalance →"XX0.00XX"	A
44	40	dataclass default	lastDrBalance →"XX0.00XX"	A
46	41	dataclass default	lastCrBalance →"XX0.00XX"	A
48	42	dataclass default	lastDrTransactionAmt →"XX0.00XX"	A
50	43	dataclass default	lastCrTransactionAmt →"XX0.00XX"	A
53	45	dataclass default	balanceAccum →"XX0.00XX"	A
61	52	dataclass default	amount →"XX0.00XX"	A
64	54	dataclass default	balance →"XX0.00XX"	A
66	55	dataclass default	balanceAccum →"XX0.00XX"	A
76	63	dataclass default	amount →"XX0.00XX"	A
83	72	dataclass default	drTransAmtLimit →"XX0.00XX"	A
85	73	dataclass default	crTransAmtLimit →"XX0.00XX"	A
87	74	dataclass default	transAmtLimit →"XX0.00XX"	A
89	75	dataclass default	MaxBalanceLimit →"XX0.00XX"	A
91	76	dataclass default	MinBalanceLimit →"XX0.00XX"	A
115	122	dead initialiser	newBalance = 0.00 →None	B
126	134	boundary normalisation	< 0.00 →<= 0.00	B
184	168	branch ordering	day_gap > 1 →>= 1	B
218	191	dead initialiser	newLastBal = 0.00 →None	B

§4A classification framework

The framework below distinguishes the three things a "surviving" mutant can actually represent. Two of the categories are well-established in the mutation testing literature: Class A is what Offutt and others call incompetent or stillborn mutants; Class B is the much-discussed Equivalent Mutant Problem, formally undecidable in the general case, which is why the framework demands written justification per mutant rather than algorithmic detection. The contribution here is operational rather than theoretical: every survivor is classified A, B, or C, with written justification per mutant, included as an appendix to any report that cites an adjusted score.

A Stillborn mutant 21 of 25 in this dataset

The mutated code does not produce a runnable program, or is misclassified by the runner. Not a behavioural change at all. Example: mutmut v2 classifying a pytest exit code 2 (module collection error) as survival. The mutated module never loaded; no tests ran.

Removedfrom the total. Per-mutant justification required.

B Equivalent mutant 4 of 25 in this dataset

The mutated code, while syntactically different, is provably semantically identical to the original under all reachable execution paths. Genuinely rare. Examples in §3: dead initialisers (a variable overwritten before use); boundary mutations where the preceding branch catches the case.

Excludedas equivalent. Per-mutant proof required.

C Coverage gap 0 of 25 in this dataset

The mutation survived because the test suite does not exercise the relevant code path. A real coverage hole. None of the survivors in this dataset fall here, but most legacy codebases produce a mix; the absence is a property of the test target, not the framework.

Reported.Counted toward the score. Test suite extended.

The framework deliberately ignores cases that don't map cleanly to behaviour: mutations on log strings, debug paths, or __repr__ output. These are real survivors but extending the test suite to assert on them is itself an anti-pattern. Where they appear in volume on a real codebase, the right response is a separate exclusion list maintained by the team, not a fourth category that pretends every line of code is a coverage candidate.

The order of the columns matters more than the categories themselves. Most mutation reports are produced by computing the score first, then writing the methodology to fit. The framework above inverts this: classification rules are committed to before the survivors are inspected, and the score is whatever falls out.

Applied to the dataset in §1, the framework treats twenty-one survivors as Class A — stillborn mutants that should never have entered the denominator at all, since they fail to produce a runnable program. The remaining four are Class B: genuine equivalents, excluded with per-mutant reachability justification. The adjusted score is still 100%, but for two distinct reasons: 21 mutants that don't count, and 4 that count but cancel. The number is unchanged. The meaning of the number is not.

§5How to read a mutation report

Three questions to bring to any mutation report someone else produced:

What is the raw kill rate?

It is the only number that cannot be inflated by methodology choices. Treat it as the floor. If a report does not state it explicitly, the report is asking the reader to take a downstream number on trust.
What is in the suspicious cohort, and has each member been classified?

"Suspicious" is a tooling decision, not a verdict. A report that promotes the suspicious cohort to "killed" without per-mutant review is reporting a tooling decision as a coverage claim. Different runners flag different things as suspicious; the convention is not portable across tools.
What justifies any equivalence claim?

Equivalence is a strong claim and requires written proof per mutant — typically a survey of execution paths showing the mutated value cannot affect observable behaviour. Anything less is either a coverage gap or a stillborn mutant dressed up as an exemption. The dataset in §3 is a worked example: twenty-one of the twenty-five "equivalents" were stillborn — the mutated module never ran — not equivalents at all.

And one for any mutation report being produced: write the classification rule before computing the score. The order is the difference between reporting what the test suite actually verifies and reporting what you want to claim it verifies.