What actually goes wrong with AI-generated architecture?

The code compiles. The tests pass. And the architecture slowly erodes.

A study testing GPT-5.1, Claude 4.5 Sonnet, and Llama 3 8B on implementing hexagonal architecture microservices (Slater, December 2025) found that proprietary models achieved 0% architectural violation rate, but Llama 3 had an 80% violation rate, bypassing interface adapters and creating illegal circular dependencies. The open-weights model also generated 60% fewer logical lines of code, a pattern the researcher called "implementation laziness."

CodeRabbit's analysis of 470 real-world PRs (December 2025) measured the convention damage: AI-authored code produces 2.66x more formatting issues, 2x more naming inconsistencies, and 3x more readability problems than human code. Even teams with linters and formatters saw elevated noise, though CI-enforced tools eliminated entire categories before review.

SonarSource reports that 42% of all committed code is now AI-generated or AI-assisted (February 2026). That's nearly half your codebase being written by tools that don't know your team's naming convention, your preferred error handling pattern, or the architectural decision you made six months ago about where business logic lives.

Three failure modes compound:

Pattern drift: Code works but doesn't follow conventions. A component uses a different state management approach than the rest of the codebase.
Convention ignorance: The agent doesn't know project-specific rules. It generates REST endpoints in a codebase that uses GraphQL.
Architecture erosion: Small deviations compound. Dependencies form that violate your layering. Business logic leaks into the presentation layer. After six months of AI-assisted development, the architecture diagram no longer describes reality.

Why aren't linters enough?

Linters catch syntax-level violations. They can enforce formatting, naming rules, and import ordering. They can't enforce architectural decisions.

"Should this business logic live in the service layer or the controller?" is not a linting question. "Should this component use the shared design system or create its own styled components?" is not a linting question. "We decided to use the repository pattern for data access, not direct ORM calls" is an architectural decision that no linter enforces.

Microsoft Research's CORE system (Wadhwa et al., FSE 2024) showed what happens when you combine LLMs with static analysis. Their dual-LLM approach (one proposes fixes, another ranks them) revised 59.2% of Python files to pass both automated tools AND human review. The LLM ranker reduced false positives by 25.8%. For Java: 76.8% pass rate, comparable to the specialized repair tool Sorald (78.3%) but with far less engineering.

Semgrep's multimodal approach (March 2026) combines deterministic rule-based analysis with AI reasoning and finds up to 8x more true positives while cutting noise by 50% compared to foundation models alone.

The pattern is clear: neither linting alone nor AI alone catches everything. The hybrid, rules plus reasoning, is what works.

How do convention files actually work?

Convention files are documents your AI coding agent reads before generating any code. They contain the rules, patterns, and decisions that make your codebase yours instead of generic training data output.

The industry has converged on a standard. AGENTS.md was donated to the Linux Foundation's Agentic AI Foundation (AAIF) in December 2025. GitHub Copilot supports three tiers of instruction files: repository-wide .github/copilot-instructions.md, path-specific .instructions.md targeting directories via glob patterns, and agent-specific files (AGENTS.md, CLAUDE.md).

Thoughtworks' Technology Radar (Volume 33, November 2025) endorsed AGENTS.md as the convention file standard for AI agents.

But there's a critical nuance. An ETH Zurich study found that LLM-generated context files actually reduced task success in 5 of 8 settings while increasing costs 20%+. The lesson: human-curated convention files that document only non-inferable details (custom commands, tooling choices, architectural constraints) are effective. Auto-generated ones that regurgitate obvious patterns are noise.

Section	Example	Why it matters
Architecture pattern	"We use hexagonal architecture. Business logic in /core, never imports from /infrastructure"	Prevents the #1 AI architecture violation
Naming conventions	"Services: PascalCase + 'Service' suffix. Files: kebab-case. No abbreviations."	Prevents the 2x naming inconsistency problem
Error handling	"Use Result types, not try/catch. Errors propagate through return values."	One of the most common AI pattern violations
Testing approach	"Integration tests hit real DB (no mocks). Unit tests for pure functions only."	AI defaults to mocking everything
Dependencies	"No new npm packages without team review. Use existing utilities in /lib."	Prevents duplication increase in AI code
Design system	"All UI uses shared components from /design-system. No inline styles."	Prevents design drift across AI-generated features

What are architecture fitness functions?

Fitness functions are automated checks that verify your architecture is intact. They're the concept from Neal Ford, Rebecca Parsons, and Patrick Kua's Building Evolutionary Architectures (Thoughtworks/O'Reilly, 2nd edition 2022), adapted for the AI era.

Where convention files prevent bad code from being written, fitness functions prevent bad code from being merged. They run in CI and answer questions like:

Does any code in /core import from /infrastructure? (layering violation)
Are there any direct database calls outside the repository layer? (pattern violation)
Does every public API endpoint have rate limiting? (security convention)
Are all new components using the shared design system? (design drift)

Track 3-5 fitness functions maximum, reviewed weekly. More than that and teams start ignoring them.

How do ADRs help AI agents maintain consistency?

Architecture Decision Records document the "why" behind your conventions. A convention file says "use the repository pattern." An ADR says "we chose the repository pattern because we need to swap databases for different clients without changing business logic (ADR-007, January 2024)."

The "why" matters for AI agents because they face novel situations your convention file doesn't cover. An agent that knows you use repository pattern AND why you chose it can make better decisions when the convention file is silent. Should this new data source go through the repository layer? If the agent knows the reason is database abstraction, it can reason about whether this case requires it.

Research on making ADRs machine-readable (Stuhmer, July 2025) found that traditional ADRs are "digital dust collectors." Adding structured metadata that AI tools can parse closes the gap between architectural intent and actual practice.

80%

architectural violation rate from Llama 3 on hexagonal architecture, vs 0% for proprietary models

Slater, December 2025

42%

of committed code is now AI-generated or AI-assisted

SonarSource, February 2026

1.7x

more issues per PR in AI-authored code, with 2x naming and 3x readability problems

CodeRabbit, December 2025

59.2%

file-level quality from hybrid LLM + static analysis, with 25.8% fewer false positives

Microsoft CORE, FSE 2024

8x

more true positives when combining rule-based analysis with AI reasoning, 50% less noise

Semgrep, March 2026

3.2%

false positive rate across 137 million code issues

SonarQube, 2025

20%+

cost increase from auto-generated convention files with reduced task success in 5 of 8 settings

ETH Zurich, via Augment Code

What the research says

"Vibe coding has practically disappeared; we now see a concerted and serious effort to think through problems of context, infrastructure and security."
Rachel Laycock, CTO, Thoughtworks Technology Radar, November 2025

"CORE's ranker LLM reduced false positives by 25.8% compared to the proposer alone, demonstrating that the two-LLM architecture is critical."
Wadhwa et al., Microsoft Research, FSE 2024

"Traditional ADRs are digital dust collectors. Making them machine-readable with structured metadata lets AI coding assistants actively enforce architectural decisions in real time."
Martin Stuhmer, Instruction by Design, July 2025

Our take

The convention problem is fundamentally a context problem. AI agents generate generic, training-data-average code because they don't know what makes your codebase different. The fix isn't better models. It's better context.

What we've found works is a three-layer stack. Convention files (AGENTS.md, CLAUDE.md) give the agent your rules before it generates anything. Static analysis plus AI reasoning (the CORE/Semgrep hybrid approach) catches what slipped through. And fitness functions in CI prevent architectural violations from ever reaching main.

The teams that skip the convention file layer and rely only on post-generation linting end up in a constant review cycle: the agent generates, the linter catches 30% of issues, the human reviewer catches another 40%, and 30% of convention violations make it through. Six months later, the architecture has drifted enough that the convention file itself needs updating. The teams that get this right invest 2-3 hours writing a good convention file upfront and save hundreds of hours in review downstream.

Key takeaway

42% of your code is now AI-generated. Without machine-readable conventions, that code will follow training data patterns, not your architecture. Three layers: convention files before generation, hybrid static analysis + AI after generation, and fitness functions in CI to prevent drift. The convention file is 2-3 hours of work that saves hundreds of hours in review.

FAQ

Why doesn't my AI coding tool follow our style guide?

Because it can't read your style guide unless you put it in a format it understands. Convention files (AGENTS.md, CLAUDE.md, .github/copilot-instructions.md) are the mechanism. They need to be human-curated, focused on non-obvious conventions, and kept in the repository root where the tool will find them.

What should go in a convention file?

Only things the AI can't infer from reading your code: architecture patterns and layering rules, naming conventions, error handling approach, testing philosophy (mocks vs real services), dependency management rules, and design system usage. Don't document things the AI can see (language, framework, file structure).

Are fitness functions the same as linting?

No. Linting checks syntax-level rules (formatting, imports, naming). Fitness functions check architectural rules (dependency direction, layer boundaries, pattern adherence). A function in the wrong layer will pass every linter but fail a fitness function.

Can AI generate good convention files automatically?

Research says no. ETH Zurich found that LLM-generated context files reduced task success in 5 of 8 settings while increasing costs 20%+. The effective approach is human-curated files that document only what the AI can't infer from the codebase itself.

How do I handle convention violations in AI-generated code?

Three checkpoints: pre-generation (convention file loaded into agent context), post-generation (hybrid static analysis + AI review catches pattern violations), and CI gate (fitness functions block merges that violate architectural rules). Each layer catches different violation types.

Does the model matter for convention adherence?

Yes. Proprietary models (GPT-5.1, Claude 4.5) achieved 0% architectural violation rate on hexagonal architecture, while open-weights Llama 3 had 80%. But even proprietary models produce 2x more naming inconsistencies than human code without convention files. The model sets the ceiling. The convention file raises the floor.

Why Does AI-Generated Code Ignore Your Architecture?