Xiaomi MiMo Hits 1000 TPS With 1T Model, Redefining AI Speed

Xiaomi MiMo Shatters Speed Barrier With 1T-Parameter Model

In a landmark achievement for large language model inference, Xiaomi's MiMo team has announced the MiMo-V2.5-Pro-UltraSpeed model, capable of generating over 1000 tokens per second. This feat, accomplished on a colossal 1-trillion-parameter model, was achieved not on exotic hardware but on standard commodity GPU nodes through deep technical collaboration with the TileRT systems team.

The release, detailed in a company blog post, marks the first time decode speed has breached the 1000 TPS barrier at the trillion-parameter scale. Xiaomi positions this not merely as a performance bump but as a fundamental shift in how AI can be applied, turning latency-bound tools into real-time cognitive partners.

The UltraSpeed Model: Access and Pricing

Access to this cutting-edge capability will be tightly controlled. The MiMo-V2.5-Pro-UltraSpeed API is launching as a limited-time trial from June 9 to June 23, 2026. It will be available only via an application process, with priority given to enterprises and professional developers with demonstrable business needs.

Pricing is set at a promotional rate of three times the cost of the standard MiMo-V2.5-Pro API, a premium Xiaomi justifies by promising approximately 10 times the generation speed. Approved users will also receive free access to a chat interface during the trial window, albeit with daily session limits to manage resource constraints.

Why 1000 TPS Is a Paradigm Shift

Reaching this speed with a model of such immense scale changes the application calculus for frontier AI. At this throughput, the model's raw speed begins to augment its intelligence. Within the same time it once took to generate a single response, the model can now explore dozens of parallel reasoning paths, effectively using brute-force speed to enhance depth and accuracy through techniques like Best-of-N sampling.

For coding agents, this eliminates the developer bottleneck of waiting for code generation. More profoundly, it enables trillion-parameter models to enter real-time decision loops. This opens doors to millisecond-critical applications like high-frequency trading signal generation, instant fraud detection, and—most significantly—real-time medical analysis where AI speed can directly impact life-saving outcomes.

continue reading below...

The Technical Breakthrough: Model-System Co-Design

Achieving this speed required innovations across both the model architecture and the underlying inference system, moving beyond isolated optimizations to a holistic co-design philosophy.

FP4 Quantization for MoE Experts: To overcome the memory bandwidth bottleneck on commodity hardware, the team applied FP4 (MXFP4) quantization. Critically, this was applied selectively only to the Mixture of Experts (MoE) layers, which constitute the bulk of the model's parameters and are highly quantization-tolerant. This dramatically reduces model size and memory pressure while preserving the core model's capabilities, as benchmark results showed performance on par with higher-precision versions.

DFlash Speculative Decoding: The team deployed an innovative block-level masked parallel prediction method called DFlash. Unlike traditional speculative decoding that uses a small serial draft model, DFlash fills an entire block of masked positions in one forward pass. Custom-tuned for the trillion-parameter MoE architecture, this approach achieved high acceptance rates, meaning the large model validates many draft tokens at once. In coding scenarios, the average acceptance length reached 6.30 tokens per verification round.

The TileRT System: Eliminating Microsecond Gaps

The model-side innovations were matched by a revolutionary systems approach from TileRT. At 1000 TPS, each operation's lifecycle is measured in microseconds, making traditional operator launch and synchronization overhead crippling. TileRT introduced a new execution model that eliminates these "execution gaps" at a fundamental level.

This involved creating persistent kernels, tile pipelines, and deep hardware-software co-design that allowed the FP4-quantized, DFlash-optimized model to run with extreme efficiency on a standard 8-GPU node. The collaboration demonstrates that extreme inference speed is achievable without resorting to wafer-scale or pure-SRAM custom silicon, a path chosen by companies like Cerebras and Groq.

Market Context and Competitive Landscape

This announcement arrives amidst a feverish pace of AI hardware and model innovation. The same week, Noctua revealed its first AIO CPU coolers, and Google launched its Gemma 4 12B model, optimized for laptops using Multi-Token Prediction for efficiency—a different approach to speeding up inference. Meanwhile, Microsoft and others are pushing powerful new models for enterprise, often highlighting legal and compliance features.

Xiaomi's achievement stands out by focusing purely on raw inference throughput for a massive model. It directly challenges the notion that such speeds require proprietary, exotic hardware. By demonstrating this on commodity GPUs, Xiaomi and TileRT are making a compelling case for the power of software and algorithmic innovation.

Implications and Future Outlook

The immediate implication is a new tier of AI service for time-sensitive, high-value applications. The limited trial suggests Xiaomi is initially targeting professional and enterprise use cases where speed directly translates to competitive advantage or operational necessity.

Looking ahead, this breakthrough validates model-system co-design as a critical path forward for AI efficiency. As the industry grapples with the soaring costs of training and inference, techniques like selective ultra-low-bit quantization and advanced speculative decoding will become essential. The race is no longer just about building bigger models, but about making them radically more accessible and responsive.

The release of MiMo-V2.5-Pro-UltraSpeed is a clear signal that the frontier of AI is expanding along the axis of speed. It redefines what is possible with a trillion-parameter model, moving it from a batch-processing engine to a real-time reasoning partner.