Qwen 3 Benchmarks Surpassing Gemini 2.5 Pro, and Grok-3
After four months, Alibaba's new model family may surpass DeepSeek-R1, the top open-weights big language model.
Qwen3 is the latest big language model from Qwen. Qwen3-235B-A22B flagship model exceeds DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro in math, coding, and general capabilities. A tiny MoE model, Qwen3-30B-A3B, beats QwQ-32B with ten times as many active parameters, and even Qwen3-4B can compete with Qwen2.5-72B-Instruct.
We are open-weighting two MoE models: Qwen3-235B-A22B, a big model with 235 billion total parameters and 22 billion activated parameters, and Qwen3-30B-A3B, a smaller model with 30 billion total parameters and 3 billion activated parameters.
Six dense models—Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B—are also open-weighted under Apache 2.0.
Hugging Face, ModelScope, and Kaggle now provide post-trained and pre-trained models like Qwen3-30B-A3B-Base. It recommends SGLang and vLLM for deployment. Ollama, LMStudio, MLX, llama.cpp, and KTransformers are recommended for local usage. These solutions make Qwen3 easy to integrate into development, production, and research workflows.
Qwen 3 allows researchers, developers, and organisations worldwide to design unique solutions using these cutting-edge models.
Try Qwen3 on the mobile app and chat.qwen.ai!
Qwen3 models introduce hybrid problem-solving. They offer two modes:
Thinking Mode: The model deliberates before responding. This is ideal for complex topics that require more thought.
Non-Thinking Mode: The model replies almost rapidly, making it suitable for simpler questions where depth is less important than speed.
As previously established, Qwen 3 delivers smooth and scalable performance benefits connected to computational reasoning budget. This design makes task-specific budgets easier to configure, improving inference quality and cost.
Supports several languages
Qwen 3 models accommodate 119 dialects. Due to their multilingual capabilities, these models may be used worldwide, opening up new possibilities.
Increased Agentic Capability
It optimised Qwen 3 models for coding and agentic capabilities and strengthened MCP support. The following examples show how Qwen3 thinks and acts.
Qwen3 has a much larger pretraining dataset than Qwen2.5. Qwen2.5 was pre-trained on 18 trillion tokens, whereas Qwen3 uses 36 trillion over 119 languages and dialects. Qwen2.5-VL applied these research to enhance it. To add math and code data, Qwen2.5-Math and Qwen2.5-Coder developed synthetic data. Code samples, textbooks, and Q&As are included.
It takes three stages to prepare for training. The model was pretrained on about 30 trillion tokens with a 4K context length in stage 1 (S1). The model learnt basic language and general knowledge at this time. In stage 2 (S2), we added STEM, coding, and reasoning challenges to the dataset. The model was pretrained with 5 trillion extra tokens. High-quality long-context data was used to extend the context to 32K tokens in the last stage. This assures the model can efficiently handle longer inputs.
Qwen 3 dense base models perform similarly to Qwen2.5 base models with more parameters due to model architectural advancements, more training data, and more efficient training methods. Qwen2.5-3B/7B/14B/32B/72B-Base and Qwen3-1.7B/4B/8B/14B/32B-Base work similarly. Qwen 3 dense base models outperform Qwen2.5 models in STEM, coding, and reasoning. For Qwen3-MoE basis models, they perform similarly to Qwen2.5 dense base models with 10% of active parameters. Thus, training and inference costs drop dramatically.
The hybrid model, which can reason step-by-step and respond swiftly, was trained using a four-stage pipeline. This pipeline includes reasoning-based reinforcement learning (RL), thinking mode fusion, long chain-of-thought (CoT) cold start, and generic RL.
First, it improved the models using lengthy CoT data from coding, maths, logical reasoning, and STEM issues. Teaching the model fundamental thinking was the goal. The second phase increased reinforcement learning computing power using rule-based incentives to better model exploration and exploitation.
The third phase enhanced the thinking model utilising extended CoT data and regularly used instruction-tuning data to include non-thinking skills. The second stage's upgraded thinking model produced this data, ensuring smooth reasoning and rapid reaction times. The fourth step employed reinforcement learning (RL) on over 20 broad-domain tasks to increase the model's general capabilities and repair undesired behaviours. Agent capabilities, format following, and instruction following were among these duties.
Qwen 3 calls tools well. To fully exploit Qwen3's agentic features, use Qwen-Agent. Qwen-Agent's inherent encapsulation of tool-calling templates and parsers simplifies development.
The MCP configuration file, Qwen-Agent integrated tool, or custom tools can define available tools.