MiniMax M3 Open-Weight Coding Model: Frontier Claims, Unverified Benchmarks

MiniMax’s M3 posts 59% on SWE-Bench Pro but all scores are vendor-run; model weights have not shipped.

MiniMax
a mobile phone's screen showing the logo of Chinese AI MiniMax in Beijing on January 21, 2026. WANG Zhao/Getty Images

Shanghai-based MiniMax launched its M3 foundation model on Monday, June 1, 2026, positioning it as the first open-weight system to combine frontier coding-agent performance, a one-million-token context window, and native multimodal capabilities — including image, video, and desktop computer operation — in a single model. The API is live today through the company's platform and third-party routing services. Developers deciding whether to route their coding workflows through M3 need to weigh three things before the pricing math: benchmark scores are company-reported and run on MiniMax's own infrastructure; promised open weights have not been released; and China's 2017 National Intelligence Law requires MiniMax to "support, assist, and cooperate" with Chinese government intelligence work, an obligation that applies to every prompt processed through the company's API endpoint, regardless of where the user is located.

How MiniMax Sparse Attention Speeds Up Million-Token Inference

The technical centerpiece of M3 is MiniMax Sparse Attention, or MSA — a new attention architecture designed to make one-million-token context windows economically viable for production use. Standard transformer models process attention across every token in a context window, a calculation that grows quadratically as context length increases. MSA replaces that with a two-stage mechanism: a lightweight index branch first scans incoming tokens and selects which blocks of the key-value cache are actually relevant, then runs the expensive attention computation only on those selected blocks.

MiniMax says MSA cuts per-token compute at a one-million-token context to 1/20th of its previous M2 generation, accelerates input processing approximately 9.7 times, and boosts response generation roughly 15.6 times at full context length. Unlike DeepSeek's Multi-head Latent Attention, MSA operates on uncompressed key-value data rather than a compressed representation, which the company says avoids precision loss at long contexts. Independent researcher Elie Bakouch characterized the approach as "block level selection like in CSA but attention is done on the real KV, not in the compressed dimension."

These speed claims come from MiniMax's own benchmarking. The company's technical report and the reproducible implementation details that would let outside engineers verify the numbers under independent conditions had not been published as of Monday afternoon. Until they are, the architectural claims — like the performance claims below — remain company assertions.

What MiniMax M3 Claims on SWE-Bench Pro and Agent Benchmarks

On SWE-Bench Pro — a harder benchmark than the widely saturated SWE-Bench Verified, designed around 1,865 real pull requests from 41 actively maintained open-source repositories — MiniMax reports M3 scored 59.0%. By comparison, the company says GPT-5.5 scored 58.6% and Gemini 3.1 Pro scored 54.2% on the same benchmark. MiniMax ran the SWE-Bench Pro evaluation on its own internal infrastructure using Claude Code as scaffolding, with evaluation logic aligned to the official methodology.

On Terminal-Bench 2.1, which measures agentic execution in terminal environments, M3 scored 66.0% — roughly matching what the company says was Anthropic's Claude Opus 4.7 baseline of 66.1%. On OSWorld-Verified, which measures a model's ability to operate desktop graphical interfaces, M3 reached 70.0%. On BrowseComp, an autonomous web-search benchmark, M3 scored 83.5% — which MiniMax reports as the highest among the models it tested.

To illustrate sustained autonomous performance, MiniMax ran three internal demonstrations. In one, M3 independently reproduced core experiments from an ICLR 2025 Outstanding Paper on LLM fine-tuning over nearly 12 hours, generating 18 commits and 23 experimental figures. In another, it optimized a matrix multiplication kernel on NVIDIA Hopper GPUs over 24 hours, completing 147 benchmark submissions and 1,959 tool calls, improving peak hardware utilization from 7.6% to 71.3%. MiniMax describes these as demonstrations of long-horizon autonomous execution, not controlled benchmark evaluations.

Where MiniMax M3 Still Trails Claude Opus 4.8

Anthropic released Claude Opus 4.8 on Monday, May 25, 2026, the week before MiniMax M3's launch. When VentureBeat compared M3's self-reported scores against Opus 4.8's benchmark figures, the gap was consistent across directly comparable agent evaluations. On SWE-Bench Pro, M3's 59.0% trails Opus 4.8's reported 69.2%. On Terminal-Bench 2.1, M3's 66.0% falls below Opus 4.8's 74.6%. On OSWorld-Verified, M3's 70.0% is behind Opus 4.8's 83.4%.

MiniMax's comparison baseline in its own materials uses Claude Opus 4.7, not the more recently released Opus 4.8. That framing is not inaccurate — Opus 4.7 was the available frontier reference when M3's evaluation was designed — but developers evaluating M3 against the current benchmark ceiling should use the Opus 4.8 figures, which place M3 further from the frontier than the launch announcement implies.

The pattern is consistent with the broader Chinese open-weight group. Kili Technology, which runs production AI evaluations, found in 2026 research that enterprise agentic AI systems show a 37% average gap between lab benchmark scores and real-world deployment performance. The Stanford HAI 2026 AI Index noted that invalid question rates on major benchmarks range from 2% to 42%, complicating direct comparisons between models regardless of who ran the evaluation.

How MiniMax M3 Benchmarks Were Run

Every benchmark figure in MiniMax's launch materials was produced by MiniMax on its own internal infrastructure, using evaluation environments MiniMax configured, with baselines MiniMax selected. Independent reviewer Thomas Wiegold observed at launch that "every one of those numbers is vendor-run, on MiniMax's own infrastructure, with baselines they picked, often using Claude Code as the scaffolding." He added: "That's not an accusation of cheating, it's just how launch-day benchmarks work."

As of Monday afternoon, independent scores from Artificial Analysis and LMArena — the two most widely cited third-party evaluation services for production AI model comparisons — were still pending for M3. OpenRouter listed M3 with a note that Artificial Analysis benchmarks were available, but the scores had not been published at the time of this article's publication.

Developers building production agent workflows, particularly those with high-stakes or regulated workloads, should treat MiniMax's launch benchmarks as a preliminary signal rather than a settled ranking and run evaluations on their own representative tasks before committing to a production architecture.

What Does MiniMax M3 Cost?

M3 API pricing at launch is $0.60 per million input tokens and $2.40 per million output tokens. A launch discount of 50% is available for the first week of availability, reducing input pricing to $0.30 and output to $1.20 per million tokens. For comparison, Anthropic's Claude Opus 4.7 charges approximately $5 per million input tokens and $25 per million output — a gap that compounds quickly in agentic workloads that make thousands of model calls per session.

MiniMax also offers subscription plans through MiniMax Code, its dedicated coding interface. The Plus plan costs $20 per month and includes approximately 1.7 billion M3 tokens; the Max plan costs $50 per month for approximately 5.1 billion tokens; and the Ultra plan costs $120 per month for approximately 9.8 billion tokens. The API supports both thinking and non-thinking inference modes and processes inputs above and below 512,000 tokens at the same rate structure.

Quota-based subscription economics behave differently from raw per-token pricing once developers factor in context length, output length, task retries, and priority access. Teams planning to use M3 at scale should model their actual workload token consumption before selecting a subscription tier.

What Does China's National Intelligence Law Require of MiniMax?

MiniMax is headquartered in Shanghai. The company listed on the Hong Kong Stock Exchange in January 2026, but its operational headquarters remain in China. Under China's National Intelligence Law, enacted in 2017, every Chinese company — including MiniMax — is legally required to "support, assist, and cooperate with state intelligence work." The obligation is not conditional on a request being made in advance; it applies continuously and provides no legal pathway for the company to refuse compliance when a government request arrives.

The American Enterprise Institute named MiniMax specifically in April 2026 analysis of the law's application to Chinese AI labs, noting that users sharing code, contracts, and strategic documents with these systems are "in effect, depositing them into a Chinese government-accessible database." The U.S. House Committee on Homeland Security and the House Select Committee on China announced a joint investigation on April 29, 2026, into national security and cybersecurity risks posed by Chinese AI models, naming MiniMax alongside Moonshot AI, Alibaba, and DeepSeek. In February 2026, Anthropic publicly alleged that MiniMax, DeepSeek, and Moonshot AI had conducted industrial-scale distillation campaigns against its Claude models, creating over 24,000 fraudulent accounts and generating more than 16 million exchanges in violation of Anthropic's terms of service. MiniMax did not issue a public denial of the distillation allegations at the time of publication.

No confirmed backdoor in M3 specifically, and no documented incident of M3 user data being shared with Chinese authorities, has been found. The National Intelligence Law obligation, however, is structural and legally confirmed — it does not require a demonstrated breach to create risk. For agentic coding workloads involving proprietary source code, customer data, or sensitive organizational information, any prompt processed through MiniMax's API endpoint falls under Chinese jurisdiction, regardless of the user's location or MiniMax's stated privacy policy.

The company also faces an active copyright lawsuit filed in September 2025 by Disney, Universal, and Warner Bros. Discovery, alleging that MiniMax trained its Hailuo AI video and image service on copyrighted characters without authorization. A federal judge denied MiniMax's motion to dismiss that case on May 26, 2026, allowing it to proceed. Hailuo AI is a separate product from M3, but the litigation reflects the legal environment in which MiniMax is expanding its commercial footprint internationally.

Open Weight or Closed Weight: The Promise That Has Not Shipped

MiniMax describes M3 as an open-weight model, but the definition matters here. Open weight means the trained model parameters are made available for download and local deployment. Open source, in the stricter sense, means the training data, training code, and license terms also permit unrestricted commercial use. MiniMax has used a modified-MIT license for prior models, which is closer to open weight than to fully open source.

At the time of launch, neither the weights nor the technical report had been released. MiniMax said both would be made available within ten days of launch, targeting publication on Hugging Face and GitHub for private cluster deployment and fine-tuning. That means developers reading this article cannot yet inspect the architecture details, verify the training setup, assess the safety behavior under edge cases, or confirm the licensing terms.

Until the weights ship and independent engineers can reproduce the architecture claims, M3's open-weight designation is a company commitment — not a verifiable fact. The release date of approximately June 11, 2026 is when that verification process can actually begin.


Frequently Asked Questions

What is MiniMax M3?

MiniMax M3 is a foundation model released on June 1, 2026 by Shanghai-based AI company MiniMax. It supports text, image, and video inputs and is designed for long-horizon coding-agent workflows, with a context window of up to one million tokens. MiniMax describes it as the first open-weight model to combine frontier coding performance, million-token context, and native multimodal capabilities, though independent verification of those claims requires the model weights, which had not been released at launch.

How does MiniMax Sparse Attention work?

MiniMax Sparse Attention, or MSA, replaces the standard transformer's full-attention computation with a two-stage process: a lightweight index branch selects which blocks of the key-value cache are relevant to a given input, and the main attention layer then processes only those selected blocks. By skipping the less relevant portions of a long context, MSA reduces per-token compute at one million tokens to roughly 1/20th of the previous generation, according to MiniMax's own measurements. Independent verification of these speed figures is pending.

Is MiniMax M3 better than ChatGPT or Claude for AI coding agents?

On MiniMax's own benchmarks, M3 scored 59.0% on SWE-Bench Pro — ahead of GPT-5.5 (58.6%) and Gemini 3.1 Pro (54.2%) on that specific test, but below Claude Opus 4.8's reported 69.2%. All of these figures come from the respective companies' own evaluations; independent third-party scores from Artificial Analysis and LMArena for M3 had not been published at launch. Developers should run evaluations on their own representative tasks before drawing production-readiness conclusions.

Is MiniMax M3 safe to use with confidential code or business data?

For non-sensitive general coding tasks, M3 is technically available and competitively priced. For workloads involving proprietary source code, customer data, or confidential business information, any prompt processed through MiniMax's API falls under the jurisdiction of China's 2017 National Intelligence Law, which requires MiniMax to cooperate with Chinese government intelligence requests. That obligation is not altered by server location, Western subsidiaries, or MiniMax's stated privacy policy, and applies regardless of whether a government request has been confirmed.

M3 represents MiniMax's most technically ambitious model release to date — a system that pairs a genuinely novel sparse attention architecture with frontier-adjacent benchmark scores at a price point well below Western closed-source competitors. The cost gap is real: M3's launch pricing is roughly one-tenth the input cost of Claude Opus 4.7 and GPT-5.5, a difference that compounds materially in agentic workflows. But the decision framework for any developer considering a production deployment includes four conditions that the price comparison cannot override: M3's benchmark scores are company-reported and not yet independently verified; M3 trails Claude Opus 4.8 by meaningful margins on all three directly comparable agent evaluations; the promised open weights have not shipped, making the architecture and safety behavior unverifiable until approximately June 11, 2026; and every prompt processed through MiniMax's API is legally accessible to the Chinese government under the 2017 National Intelligence Law, regardless of the user's location, MiniMax's privacy policy, or the physical location of its servers. The cost advantage is a factor. It is not the only factor.

ⓒ 2026 TECHTIMES.com All rights reserved. Do not reproduce without permission.

Join the Discussion