Static analysis is not enough

In December I wrote about using LLMs as a discovery layer for security bugs. The idea was simple: LLMs find the patterns, you codify them into deterministic rules, and the rules run every time. Discovery is probabilistic, execution isn’t.

Nicholas Carlini gave a talk at [un]prompted 2026 showing what LLMs actually find when you point them at real codebases. And then Anthropic published their Mythos Preview results. The discovery layer works better than I expected. The “turn it into rules” part is harder than I thought.

What Carlini found

Carlini is on Anthropic’s Frontier Red Team. His setup was minimal: Claude in a VM, told it’s playing a CTF, asked to find the worst bug it can. No fancy scaffolding.

Ghost CMS (CVE-2026-26980, CVSS 9.4). Ghost has 50K+ GitHub stars and had never had a critical severity vulnerability in its history. Claude found an unauthenticated blind SQL injection in the Content API’s slug filter ordering. The model wrote the exploit, extracted prod db credentials and the admin API key. Full account takeover, no login required. Found in about 90 minutes.

Linux kernel NFS (CVE-2026-31402). A heap buffer overflow in the NFSv4.0 LOCK replay cache. The server allocates a 112-byte buffer for cached responses (NFSD4_REPLAY_ISIZE). But when a LOCK request is denied, the denial response includes the conflicting lock’s owner ID, which the NFS protocol allows to be up to 1024 bytes (NFS4_OPAQUE_LIMIT). The response overflows the buffer by up to 944 bytes. Triggering it requires two cooperating NFS clients: one sets a lock with a large owner ID, the other requests a conflicting lock. The bug had been sitting in the kernel since March 2003, predating git itself.

Why static analysis missed these

Coverity has scanned the Linux kernel since around 2006. The kernel also gets regular passes from smatch, sparse, Coccinelle, and CodeQL. Ghost is a Node.js project with access to Semgrep, Snyk Code, SonarQube, and everything else in the JavaScript security tooling ecosystem. OpenBSD runs its own static analysis and has one of the most security-conscious development cultures in open source. None of these bugs were caught.

The Ghost bug

The taint path looks like this: user input arrives as a filter query parameter on the Content API. Ghost parses it through its custom NQL (filter query language). Slug values get extracted by regex in a helper function called slugFilterOrder(). That function builds a raw SQL CASE statement by string concatenation:

// The vulnerable code
let order = 'CASE ';
orderSlugs.forEach((slug, index) => {
    order += `WHEN \`${table}\`.\`slug\` = '${slug}' THEN ${index} `;
});
order += 'END ASC';

This string gets returned to a different file, which passes it to Knex’s orderByRaw(). Knex parameterizes queries automatically, but orderByRaw takes a pre-built string. The SQL was already assembled before Knex ever saw it.

SAST tools model standard sources like req.query and standard sinks like knex.raw(). But this path goes: HTTP parameter -> custom NQL parser -> regex extraction -> helper function return value -> orderByRaw in a different file. That’s four or five abstraction layers. Semgrep’s Knex SQL injection rule looks for tainted data flowing directly into Knex raw methods. It doesn’t trace through a helper function that returns a string that later becomes raw SQL. CodeQL could theoretically catch it with a custom query modeling Ghost’s NQL parser as a taint source, but someone would need to write that query with Ghost-specific knowledge first.

Real-world SAST detection rates for SQL injection sit between 11% and 27%. Combining multiple tools gets you to around 39%. The hard ones slip through.

The kernel bug

This one is worse from a static analysis perspective. No single line of code is wrong. The 112-byte buffer is allocated correctly. The read_bytes_from_xdr_buf() call copies len bytes correctly. The len variable is the actual encoded response size, computed correctly. The problem is that nobody checked whether len could exceed 112.

The two constants (NFSD4_REPLAY_ISIZE at 112 bytes and NFS4_OPAQUE_LIMIT at 1024 bytes) live in different files. They’re connected through NFS protocol semantics, not through dataflow. The buffer was sized for OPEN responses back in 2003. LOCK denial responses can be larger because they include the conflicting owner ID. This is a design mismatch between two independently correct subsystems.

You can’t write a Coverity checker for “these two #define constants in different subsystems should be compatible given the protocol spec.” That’s a question about intent: what is this code supposed to do versus what does it actually do?

The OpenBSD SACK bug

Mythos found this one: CVE-2026-24882, a remote denial-of-service in OpenBSD’s TCP SACK implementation. It had been there since 1998, when OpenBSD added SACK support. 27 years.

Some background on SACK. When you send data over TCP, the receiver normally acknowledges bytes in order: “I got everything up to byte 5000.” If packets arrive out of order (say bytes 5000-6000 are lost but 6000-8000 arrive fine), the receiver can only say “I got up to 5000” and the sender has to resend everything from 5000 onward. SACK (Selective Acknowledgment) fixes this. The receiver says “I got up to 5000, and I also have 6000-8000.” Now the sender only resends the missing chunk. The sender tracks the gaps (“holes”) in a linked list: which byte ranges haven’t been acknowledged yet.

Three things go wrong together.

First, when the sender receives a SACK block, it validates that the end of the acknowledged range falls within the send window (the range of bytes the sender has actually sent). But it never checks the start. On its own, this is harmless. A SACK block claiming to start before the send window just describes a range that overlaps with already-acknowledged data. The code handles it fine.

Second, there’s a linked list corruption that can happen when processing a SACK block. The sender walks its list of holes (unacknowledged ranges) and, depending on where the SACK block falls, either shrinks a hole, splits it, or deletes it entirely. The bug: if a single SACK block simultaneously causes the only hole in the list to be deleted and triggers the code path that tries to append a new hole, the walk frees the only node and there’s nothing left to link onto. The code dereferences a NULL pointer. Kernel panic, machine crashes. But under normal conditions, this can’t happen. A legitimate SACK block can’t simultaneously delete all holes and create new ones. The two operations are mutually exclusive for any valid byte range.

The third piece is what makes it exploitable. TCP sequence numbers are 32-bit unsigned integers that wrap around (after 4,294,967,295 comes 0). To compare them, the standard trick is (int)(a - b) < 0, which handles wraparound correctly as long as a and b are within 2^31 of each other. But “as long as” is doing a lot of work. An attacker crafts a SACK block with a start position roughly 2^31 bytes away from the legitimate sequence space. The unsigned subtraction wraps, the cast to signed int overflows, and the comparison gives the wrong answer. The code thinks the SACK block starts inside the send window when it actually starts far outside it. Now the first bug matters: since the start was never validated independently, this bogus SACK block enters the hole-walking logic with coordinates that make no geometric sense. It hits the exact condition where the only hole gets deleted and a new append is attempted simultaneously. NULL dereference. Crash.

An attacker can do this remotely. Send a few crafted TCP packets to any OpenBSD machine that responds over TCP, and the kernel panics.

So what about static analysis? It actually has a real shot at part of this. Signed integer overflow is undefined behavior in C, and tools like UBSan and Coverity can flag (int)(unsigned - unsigned) < 0 as potentially dangerous. GCC’s -Wstrict-overflow would catch it too. The problem is that signed subtraction for TCP sequence number comparison is an extremely common pattern across every network stack. Every OS does it. Linux does it. FreeBSD does it. It’s in RFC 1323’s reference implementation. Flagging it produces a wall of false positives against code that has worked fine for decades. So analyzers either skip the pattern entirely or nobody reads past the first hundred results.

The linked list corruption and the missing bounds check on the range start are harder. You’d need to reason about what combinations of SACK blocks can reach which code paths, and what the linked list looks like after each operation. A static analyzer would need to understand that: this comparison can return a wrong answer for extreme inputs, which allows this range to enter the walk with impossible coordinates, which triggers this specific delete-and-append condition, which dereferences NULL. Dataflow analysis can’t follow that. The chain involves protocol semantics, integer representation, and data structure invariants interacting across multiple functions.

The discovery-to-rule gap

My December post proposed a clean pipeline: LLM discovers a pattern, you encode it as a deterministic rule, the rule catches it every time going forward.

For the Ghost bug, you could write a rule after the fact. It would need to model Ghost’s NQL parser as a taint source and all Knex raw methods (including orderByRaw) as sinks. That’s a custom rule per framework, per ORM, per query builder. It’s doable but doesn’t scale. And it only catches the exact shape of bug the LLM already found.

For the kernel bug, there’s no reasonable rule to write. “Check that all buffer sizes are compatible with all possible payload sizes across all protocol state machine paths” is not a rule. It’s a research project. The vulnerability class (design mismatches between subsystems connected only through protocol semantics) can’t really be turned into a rule.

The OpenBSD bug is somewhere in between. You could write a rule for the signed integer overflow in sequence number comparison: flag (int)(a - b) < 0 where a and b are unsigned. But that pattern is everywhere in network code, and the rule would fire on thousands of correct callsites. The linked list corruption and the missing bounds check can’t be captured without modeling the protocol state machine. You’d be writing a partial rule that catches the least interesting part of the bug.

Some discoveries convert to rules cleanly, others don’t, and the harder the bug, the less likely it converts.

The capability curve

Then there’s Mythos Preview. Beyond the OpenBSD SACK bug, it wrote FreeBSD remote root exploits using ROP chains split across six packets. It chained four browser vulnerabilities using JIT heap sprays. Where Opus 4.6 produced 2 working Firefox exploits from a set of known vulnerabilities, Mythos produced 181 from the same set.

Carlini noted in his talk that models from six months ago couldn’t find these bugs at all. Capability roughly doubles every four months, and the bugs these models find keep getting harder to turn into static rules.

The Anthropic Frontier Red Team paper documented 500+ validated high-severity vulnerabilities found by Claude in prod open-source software. The IRIS paper (ICLR 2025) showed that combining LLM reasoning with CodeQL’s dataflow analysis found 103.7% more vulnerabilities than CodeQL alone, including 4 previously unknown bugs that no existing tool could detect. The hybrid approach works, but the LLM is the one finding the hard cases.

What about cheap models?

If LLMs have to stay in the loop, maybe we can use cheap ones. Three tiers: frontier for discovery, small models for scoped analysis, rules for patterns we already understand. If that works, costs come back down.

AISLE ran that experiment and tested 25+ models against the bugs Mythos found. A $0.11-per-million-tokens model (GPT-OSS-20b, 3.6B active params) found the FreeBSD NFS overflow. GPT-OSS-120b got the SACK bug. Their conclusion: “the moat is the system, not the model.”

But AISLE handed models the vulnerable function directly, with hints like “consider wraparound behavior.” That’s not discovery, that’s confirmation. And their own follow-up data showed most small models also flagged the patched FreeBSD code as vulnerable. A model that can’t tell patched from unpatched code isn’t understanding the bug. It’s matching a pattern.

antirez went further. His argument: bug-finding is not proof of work. With proof of work, more compute eventually finds a hash collision. With bugs, more samples from a weaker model don’t substitute for deeper reasoning. You hit an intelligence ceiling, and the ceiling matters more than the token budget. He tested small models on the SACK bug himself. They couldn’t reason about the three-way interaction. They hallucinated patterns or missed the chain.

He goes another step: moderately strong models actually perform worse than weak ones. Weak models hallucinate freely and sometimes stumble onto something that looks right. Medium models hallucinate less but still can’t follow multi-step chains. They miss what the weak models luck into, and fail at what only the frontier can do.

If he’s right, the middle tier doesn’t work. The cheap model either hallucinates (false positive rate kills you) or misses (same problem as static analysis). The hard bugs need the frontier model.

So now what

The layered model from my December post still holds for a class of bugs. LLMs discover SQL injection through string concatenation, you write an AST rule, done. That pipeline works and should be built.

But there’s a growing set of vulnerabilities (cross-component design mismatches, protocol-level semantic bugs, multi-step state machine errors) where the LLM isn’t only doing discovery. The analysis lives in the model too, because there’s no rule to delegate it to downstream. No rule captures “understand what this code is supposed to do according to the protocol spec and check whether the implementation matches.” That requires reasoning, not pattern matching.

So LLMs probably need to stay in the loop as a permanent analysis layer, not just a temporary discovery phase that feeds into rules. The cost and consistency problems I raised in December still matter. Running an LLM on every commit is expensive, and the same model flags a bug 7 out of 10 times. But for bugs Coverity missed for 20 years, a flaky model that catches them 7 times out of 10 beats a deterministic checker that catches them zero times out of 10.

Carlini said LLMs are the most significant event in security since the internet. I don’t think he’s wrong. The question is how to build systems that use them reliably, knowing that “convert to deterministic rules” only covers part of the problem.

Sources

CVE-2026-26980: Ghost CMS SQL injection (GitHub Advisory)
CVE-2026-31402: Linux kernel NFS heap overflow (Tenable)
CVE-2026-24882: OpenBSD TCP SACK kernel panic (OpenBSD Errata)
Ghost PR #26419: The fix
Nicholas Carlini - Black-hat LLMs | [un]prompted 2026
Evaluating and mitigating the growing risk of LLM-discovered 0-days (Anthropic Frontier Red Team)
Claude Mythos Preview’s Cybersecurity Capabilities (Anthropic)
AI Finds Vulns You Can’t (Security Cryptography Whatever podcast)
LLMs vs Static Code Analysis: A Systematic Benchmark (arXiv)
IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities (ICLR 2025)
Testing Static Analysis Tools Using Exploitable Buffer Overflows (MIT Lincoln Laboratory)
Hardening Firefox with Anthropic’s Red Team (Mozilla)
AI cybersecurity after Mythos: the jagged frontier (AISLE)
AI cybersecurity is not proof of work (antirez)