Evaluation
Benchmark results, clearly explained
Compare our distilled Qwen3-4B models against the base model across ARC, GPQA, HellaSwag, MMLU, TruthfulQA, and WinoGrande. All tests use 4-bit quantization via lm-eval harness.
Key Insights
HIGHEST AVERAGE
Gemini 2.5 Flash
54.0% average across all benchmarks(+1.9 vs base)
MOST CONSISTENT
Gemini 2.5 Flash
Beats base model on 5/6 benchmarks(+11.4 total)
BEST GPQA
GPT-5 Codex
43.9% on graduate-level science(+13.6 vs base)
Performance Comparison
Pick a benchmark to sort the chart. Click a model to pin it and view all scores.
Average accuracy across all 6 benchmarks. Higher is better.
BaseAbove baseBelow base
Base: 52.1%Base Model
52.1%
#1Gemini 2.5 Flash
54.0%+1.9
#2GPT-5 Codex
53.1%+1.0
#3Kimi K2
51.6%-0.5
#4GLM 4.6
51.5%-0.6
#5Gemini 2.5 Pro
51.3%-0.8
#6Claude 4.5 Opus
51.2%-0.9
#7Command A
51.2%-0.9
#8Gemini 3 Pro
50.8%-1.3
#9GPT-5.1
50.6%-1.5
Full Results
All scores shown as accuracy percentages. Green indicates improvement over base model, red indicates regression.
| Model | ARC | GPQA | HellaSwag | MMLU | TruthfulQA | WinoGrande | Avg |
|---|---|---|---|---|---|---|---|
| Base (Qwen3-4B) | 48.6 | 30.3 | 48.0 | 65.5 | 55.6 | 64.6 | 52.1 |
| Gemini 2.5 Flash | 51.2(+2.6) | 35.4(+5.1) | 50.4(+2.5) | 66.2(+0.6) | 55.3(-0.3) | 65.6(+1.0) | 54.0(+1.9) |
| Claude 4.5 Opus | 48.1(-0.5) | 31.3(+1.0) | 49.6(+1.6) | 63.4(-2.2) | 52.6(-3.0) | 62.1(-2.4) | 51.2(-0.9) |
| Gemini 2.5 Pro | 48.5(-0.1) | 30.8(+0.5) | 48.5(+0.5) | 64.3(-1.2) | 54.4(-1.2) | 61.2(-3.4) | 51.3(-0.8) |
| GPT-5 Codex | 45.9(-2.7) | 43.9(+13.6) | 47.7(-0.3) | 62.5(-3.0) | 57.0(+1.5) | 61.3(-3.3) | 53.1(+1.0) |
| Kimi K2 | 45.8(-2.8) | 37.4(+7.1) | 49.2(+1.2) | 62.0(-3.5) | 52.5(-3.1) | 62.7(-1.9) | 51.6(-0.5) |
| GLM 4.6 | 48.7(+0.1) | 32.3(+2.0) | 48.3(+0.4) | 64.3(-1.3) | 53.1(-2.5) | 62.2(-2.4) | 51.5(-0.6) |
| GPT-5.1 | 47.8(-0.9) | 29.8(-0.5) | 48.0(+0.0) | 63.6(-1.9) | 55.7(+0.1) | 58.6(-5.9) | 50.6(-1.5) |
| Command A | 45.8(-2.8) | 31.8(+1.5) | 48.8(+0.8) | 63.5(-2.1) | 54.9(-0.7) | 62.2(-2.4) | 51.2(-0.9) |
| Gemini 3 Pro | 46.5(-2.1) | 34.9(+4.6) | 48.2(+0.2) | 62.4(-3.1) | 50.5(-5.1) | 62.3(-2.3) | 50.8(-1.3) |
Methodology
Test Configuration
- • Quantization: 4-bit (matching typical deployment)
- • Framework: lm-eval harness
- • Temperature: 0.6
- • Top-p: 0.95
Benchmarks Used
- • ARC-Challenge: Science reasoning
- • GPQA Diamond: Graduate-level science
- • HellaSwag: Commonsense reasoning
- • MMLU: Multi-task language understanding
- • TruthfulQA: Truthfulness evaluation
- • WinoGrande: Pronoun resolution