IID.systems
ProfileServicesFormal MethodsAI AlignmentEssaysBookSchoolGitHub日本語
日本語

The Problem of Absent Specification in TDD

Why TDD Thrived, and Why It Reaches Its Limits in the AI Era

Why TDD Became Popular — Compatibility with Human Cognition

TDD's widespread adoption owes much to its compatibility with human cognitive patterns. "Given this input, this output should be returned" — this style of concrete example is far more intuitive than writing abstract specifications. For most developers, enumerating specific input/output pairs aligns with natural thought processes more than formally describing mathematical invariants and preconditions.

The characteristics of the domain where TDD primarily took hold — web development — are also significant. Web applications have a relatively high tolerance for defects compared to mission-critical systems. A critical bug can be redeployed immediately. A user refresh resolves the issue. Outside of payment processing and a few other areas, rapid iteration is prioritized over strict correctness.

In this environment, the "write tests quickly and fix things as you go" approach offered better economic returns than "define a perfect specification upfront." TDD's popularity stems less from inherent methodological superiority than from its high practicality within the specific context of web development.


Where Is the Specification? — TDD's Structural Flaw

TDD's principle of "write the test first" sidesteps a fundamental question: what guarantees the correctness of that test?

Test cases are fragments of a specification. But in TDD, that specification is never materialized. It exists only implicitly, fragmentarily, and likely incompletely in the test author's mind. The developer writes tests based on their belief about how "this API should behave" — but that "should" rests on subjective understanding.

The critical point is that tests function as a substitute for the specification itself. In TDD it is often said that "tests are executable specifications," but this conflates two distinct concepts. Tests are verification mechanisms that reflect a portion of the specification — they are not the specification itself. A finite set of input/output pairs merely samples from the infinite input space that the specification defines.

Dependence on the Test Author's Tacit Knowledge

Test quality depends entirely on the test author's knowledge and experience. Experienced developers think of edge cases, but even they cannot test "the cases they didn't think of." When multiple people write tests, their implicit understanding of the specification may differ. Person A understands "empty strings are errors" while Person B understands "empty strings are allowed" — such inconsistencies can quietly coexist within a test suite, undetected.

Risk of Contradictory Tacit Knowledge

If a formal specification exists, such contradictions are detectable at the specification level. If VDM-SL states "name must not be empty," any test contradicting this is clearly identified as a specification violation. But in TDD, no mechanism exists to ensure consistency among tests themselves.


Generating Tests from Existing Code — Institutionalizing Bugs

TDD's ideal is "write the test first," but in real-world projects, adding tests to existing code after the fact is extremely common. Legacy code refactoring, inherited projects, or code that simply started without tests — when writing tests in these situations, what does the developer reference?

The answer is "the current implementation." They read the code, understand its behavior, and write tests that reproduce that behavior. This process has a structural defect.

Codifying Bugs as "Correct"

When code contains bugs, that buggy behavior gets embedded in the tests as the "correct specification." For instance, if tax calculation code rounds up when it should round down, the test author writes the rounded-up result as the "expected value." The test passes, but the business rule is wrong.

# Existing code (bug: rounding up instead of down) def calc_tax(price): return math.ceil(price * 0.1) # Should be floor # Test author infers spec from code def test_calc_tax(): assert calc_tax(105) == 11 # Test passes. But correct answer is 10

This test only guarantees that "the code behaves as it currently does" — not that "the code behaves correctly." A green test is not evidence of correctness; it is evidence of status quo preservation.

The Illusion of Safe Refactoring

"We can refactor safely because we have tests" is considered a key TDD benefit. But when tests are based on a buggy specification, "tests pass after refactoring" is synonymous with "bugs have been preserved." Tests function as a safety net only when the specification underlying them is correct.


Why TDD Does Not Fit the AI Era

The structural problems described above were accepted as "tolerable limitations" in human-centered development. The humans writing tests possess domain knowledge, maintain implicit specifications in their heads, and catch inconsistencies through test reviews — human judgment mitigated the problems.

But when AI becomes the center of development, these assumptions collapse at their foundation.

Tacit-Knowledge Sharing Is Not Guaranteed Between AIs

In human teams, domain knowledge is implicitly shared. "This is the industry convention for such calculations" or "This client insists on this specification" — such knowledge is transmitted through conversation and review even without documentation. Between agents of different models, vendors, and versions, there is no guarantee that this sharing holds — and the human team's corrective mechanism of implicit alignment through conversation and review does not operate.

Specification Agreement Between Agents Is Impossible

In multi-agent development, multiple AI agents handle different modules. When Agent A handles authentication, Agent B handles payments, and Agent C handles the frontend, the interface specifications between agents must be precisely agreed upon. TDD cannot achieve this agreement. Each agent writes tests based on "its own understanding," and each agent's tests pass individually — but the system fails when integrated. The absence of materialized specification is fatal for inter-agent collaboration.

Tests verify behavior of individual modules; they do not define contracts between modules. "Agent A's output is of this type and satisfies these preconditions." "Agent B accepts this input and guarantees these postconditions." Such contracts should be explicitly defined as formal specifications, not as collections of test cases.


Structural Resolution Through Formal Methods

Formal methods resolve TDD's problems at their root. By writing specifications in a formal language such as VDM-SL, the specification becomes a concrete artifact. Test cases are derived from the specification rather than depending on developer tacit knowledge.

Specification Materialization

Specifications are made explicit in mathematically rigorous notation rather than remaining tacit knowledge. The specification itself becomes a verifiable artifact.

Test Derivation

Test cases are systematically derived from the specification. Strategies can be formed to cover the entire space defined by the specification.

Inter-Agent Contracts

Each agent's interface is formally defined as preconditions, postconditions, and invariants. Contract consistency can be verified before integration.

Prevention of Bug Codification

Because the basis of tests is the formal specification, code bugs do not propagate into tests. Divergence between specification and code is detectable.