Run an LLM-based code reviewer on a code change 10 times and it flags a SQL injection vulnerability 7 times. The other 3 runs come back clean. Same code, same vulnerability, different result.

LLMs are probabilistic, but security requirements are binary. “Usually catches security issues” is a bug, not a feature.

When LLMs do catch something, the finding is often good because they pick up on subtle, contextual patterns that would be hard to write rules for from scratch. The trick is using them differently: not for the analysis itself, but for discovering what to look for, then turning those discoveries into deterministic rules.

Say the LLM notices developers building SQL queries by concatenating user input. Now encode that as a rule: string concatenation in database query contexts, glob patterns for file matching, regex or AST analysis for the code structure. That pattern gets caught 10 out of 10 times.

You keep the LLM running to find new things, but once you understand a pattern well enough, you codify it and stop relying on the LLM for that one.

There’s a cost angle too. LLM analysis is slow and expensive, while rule execution is fast and cheap, so converting common patterns to rules means scanning massive codebases in seconds instead of hours.

The architecture that falls out of this is layered: an LLM discovery layer hunting for new issues, a rule execution layer catching known patterns every time, and a feedback loop turning discoveries into rules. Building it takes real work because LLMs pick up on context that’s hard to encode, and you need infrastructure for rule lifecycles. But developers get consistent feedback, and security teams get results they can trust.

This is where code analysis ends up. Not GPT-N finding more bugs on its own, but systems that use LLMs to figure out what the bugs look like, then never miss them again.