
A security researcher published six vulnerabilities in llama.cpp's model-file parser to the oss-security mailing list on May 15, 2026 — and none of them carry an assigned CVE number, meaning standard scanner-driven patch workflows will not catch them. The most severe flaw, catalogued V-01, allows a maliciously crafted GGUF file to trigger an integer overflow inside the GGML_PAD macro on 32-bit systems, producing an arbitrary file seek followed by an out-of-bounds memory read before inference ever begins. Anyone who downloads AI models from public repositories — including Hugging Face — and loads them into Ollama, LM Studio, or any other llama.cpp-backed tool is in the attack window right now.
What Is GGUF, and Why Does Its Parser Matter?
GGUF — short for GPT-Generated Unified Format — is the binary serialization standard used to package and distribute quantized large language model weights for local AI model security purposes and inference. It replaced the older GGML format and is now the dominant distribution format for consumer-grade AI models. Because llama.cpp is the core inference backend for Ollama, LM Studio, and dozens of other tools, every security flaw in its parser is inherited by every application that calls it. A vulnerability at the parsing layer fires the moment a file is loaded — before the model runs, before the user interacts with it, and before any application-level safeguard has a chance to intervene.
The GGUF format is broad and expressive, carrying not just model weights but metadata fields, tensor descriptions, tokenizer data, and embedded chat templates. That richness is exactly what makes its parser an attractive attack surface: each field parsed from attacker-controlled file content is a potential injection point.
Six Vulnerabilities, Ranked by Severity
The May 15 advisory enumerates six distinct issues, labeled V-01 through V-06, spanning severity ratings from critical to medium.
V-01 is the most dangerous. The general.alignment field in a GGUF file is validated only for being a power of two and non-zero — but there is no upper-bound check. Setting this field to 0x80000000, or any value at or above 2^16, causes an integer overflow in the GGML_PAD macro on 32-bit systems. The overflow result is then passed directly to a gguf_fseek() call at line 703 of gguf.cpp, enabling an arbitrary file seek followed by an out-of-bounds read. The same logic flaw exists in the Python reference implementation, gguf_reader.py. The advisory classifies this as a CVE candidate and recommends an immediate fix: reject any alignment value where alignment < 4 || alignment > 1048576.
V-02 enables memory exhaustion. Two preprocessor constants — GGUF_MAX_STRING_LENGTH and GGUF_MAX_ARRAY_ELEMENTS — are each set to one gigabyte. A crafted file that declares a 1 GB string inside a 1 GB array can attempt allocations on a scale that crashes 32-bit systems with std::bad_alloc. The fix is to reduce GGUF_MAX_STRING_LENGTH to 64 MB.
V-03 hits Python tooling specifically. The C++ parser rejects tensors with more than four dimensions; the Python equivalent in gguf_reader.py applies no such check. Specifying n_dims = 0xFFFFFFFF triggers an approximately 32 GB memory-map attempt. This flaw affects every Python-based GGUF conversion and inspection tool in the ecosystem.
V-04 through V-06 are medium severity. V-04 involves an implicit signed-to-unsigned type conversion that can allow oversized inputs to bypass validation on platforms where int64_t and size_t differ in size. V-05 casts a raw 32-bit integer from the file directly to a gguf_type enum without any bounds check, which can cause gguf_type_size() to return zero and set up a division-by-zero condition. V-06 realizes that condition: if ggml_blck_size() returns zero for a tensor's quantization type, the expression ne[0] / blck_size at lines 662–668 of gguf.cpp divides by zero, crashing the process.
How a Malicious GGUF File Exploit Reaches Your Machine
The attack path requires no network exploit. A researcher, developer, or enthusiast downloads a GGUF model from a public repository such as Hugging Face — a routine step in any local-AI workflow — and loads it into their inference stack. Parsing begins immediately. The malicious payload fires before the first token is generated.
This AI supply chain attack vector has been independently documented by multiple security firms. In July 2025, Pillar Security disclosed a technique it called "Poisoned GGUF Templates," in which attackers embed malicious Jinja2 instructions directly into GGUF chat template metadata, compromising every interaction a user has with the model while leaving no visible trace in logs. Pillar's disclosure noted that over 1.5 million GGUF files are distributed on public platforms, and that LM Studio determined at the time that users are responsible for reviewing and downloading trusted models. SGLang, a separate inference framework, received a CVSS 9.8 advisory in April 2026 for a remote code execution flaw triggered by loading malicious GGUF files through an unsandboxed Jinja2 rendering path.
The pattern Databricks documented in 2024 has not changed: the GGUF format, like image and archive formats before it, generates a long tail of parser vulnerabilities because it is expressive, binary, and processed before any trust decision is made.
Ollama Security Flaw Context: Bleeding Llama Is Separate but Related
The six new flaws are distinct from Bleeding Llama — CVE-2026-7482, scored 9.1 — which Cyera researcher Dor Attias discovered in Ollama's Go-language GGUF model loader and disclosed publicly in early May 2026. Bleeding Llama exploits Ollama's use of Go's unsafe package in the quantization pipeline: an unauthenticated attacker who can reach the Ollama HTTP API sends a crafted GGUF file with inflated tensor dimensions to the /api/create endpoint, causing the application to read beyond its allocated heap buffer and leak process memory — including environment variables, API keys, system prompts, and concurrent users' conversation data. That leaked data is then exfiltrated via the /api/push endpoint in three unauthenticated API calls, leaving no error in the logs. The fix is in Ollama 0.17.1; any version before that should be treated as compromised if the instance was internet-accessible.
The May 15 advisory covers the C++ parser in gguf.cpp and the Python gguf_reader.py — a different code path in the same ecosystem. Operators who patched Bleeding Llama have not addressed V-01 through V-06.
What Does This Mean for Local Model Users?
What does this mean for the estimated hundreds of thousands of Ollama deployments and the larger population of LM Studio and direct llama.cpp users? It means that the act of downloading and loading a model file is now an explicit step in the local-AI threat model — not an implicit one. A malicious upload to Hugging Face, or to any public model repository without automated integrity checking, can reach a developer's laptop or an organization's inference server before a CVE alert is issued, before a scanner detects it, and before a patch is available.
The May 15 advisory is a case in point: the six vulnerabilities were discovered and published on the same day, with no coordinated disclosure period and no CVE numbers. Organizations relying on National Vulnerability Database-driven scanners had no alert surface. The same delay occurred with Bleeding Llama: the fix shipped in Ollama 0.17.1 on February 24, but a CVE was not assigned until April 28 — leaving a nearly two-month window in which automated detection failed entirely.
How to Protect Your llama.cpp Installation Right Now
Until the llama.cpp maintainers release a patched build addressing V-01 through V-06, the advisory recommends the following mitigations.
For V-01, apply an upper-bound check rejecting any general.alignment value less than 4 or greater than 1,048,576. For V-02, patch the preprocessor constants so that maximum string and array lengths are capped at 64 MB. For V-03, add a bounds check to gguf_reader.py that raises a ValueError if n_dims exceeds 4. For V-04, use consistent unsigned types or check against PTRDIFF_MAX. For V-05, validate the raw integer before casting it to the gguf_type enum. For V-06, add an explicit zero-check in tensor parsing before any division operation.
If you cannot apply source patches immediately, the highest-impact interim measure is to restrict model loading to sources you have verified through a hash or cryptographic signature — and to stop loading GGUF files from uncurated uploads on public model-sharing platforms. Operators running Ollama, LM Studio, or llama.cpp-backed inference servers exposed to a local network or the internet should treat this with particular urgency.
TechTimes has reached out to the llama.cpp maintainers and Ollama for comment and will update this article as patches and CVE assignments become available.
Frequently Asked Questions
What is the llama.cpp GGUF parser vulnerability disclosed in May 2026?
A researcher published six security flaws in llama.cpp's GGUF file-format parser to the oss-security mailing list on May 15, 2026. The most critical flaw — V-01 — allows a specially crafted GGUF model file to trigger an integer overflow that enables an arbitrary file seek and out-of-bounds memory read, affecting every version of llama.cpp that has used the GGUF format since version 3. No CVE numbers have been formally assigned to any of the six flaws.
How do I protect my local AI setup from malicious GGUF files?
Until the llama.cpp maintainers release a patched build, the most effective interim measure is to load GGUF model files only from sources you have verified through a cryptographic hash or a trusted, curated registry. Avoid loading files from uncurated public uploads on model-sharing platforms. Organizations running Ollama should also confirm they are running version 0.17.1 or later to address the separate Bleeding Llama vulnerability.
Is Ollama affected by these new llama.cpp flaws?
The six new flaws disclosed May 15, 2026 target the C++ and Python GGUF parsers inside llama.cpp itself — a different code path from the Bleeding Llama flaw (CVE-2026-7482) that was separately fixed in Ollama 0.17.1. Because Ollama uses llama.cpp as its inference backend, it inherits parser-level exposure to the new flaws. Ollama has not yet issued a statement on V-01 through V-06 as of this writing.
What is the Bleeding Llama vulnerability?
Bleeding Llama (CVE-2026-7482, CVSS 9.1) is a heap out-of-bounds read in Ollama's GGUF model loader discovered by Cyera researcher Dor Attias. An unauthenticated attacker who can reach the Ollama API can send a crafted GGUF file to leak process memory — including environment variables, API keys, and user conversation data — in three API calls. The fix is Ollama version 0.17.1.
ⓒ 2026 TECHTIMES.com All rights reserved. Do not reproduce without permission.




