Evaluation

Benchmark results, clearly explained

Compare our distilled Qwen3-4B models against the base model across ARC, GPQA, HellaSwag, MMLU, TruthfulQA, and WinoGrande. All tests use 4-bit quantization via lm-eval harness.

Key Insights

HIGHEST AVERAGE

Gemini 2.5 Flash

54.0% average across all benchmarks(+1.9 vs base)

MOST CONSISTENT

Gemini 2.5 Flash

Beats base model on 5/6 benchmarks(+11.4 total)

BEST GPQA

GPT-5 Codex

43.9% on graduate-level science(+13.6 vs base)

Performance Comparison

Pick a benchmark to sort the chart. Click a model to pin it and view all scores.

Average accuracy across all 6 benchmarks. Higher is better.

BaseAbove baseBelow base
Base: 52.1%
Base Model
52.1%
#1Gemini 2.5 Flash
54.0%+1.9
#2GPT-5 Codex
53.1%+1.0
#3Kimi K2
51.6%-0.5
#4GLM 4.6
51.5%-0.6
#5Gemini 2.5 Pro
51.3%-0.8
#6Claude 4.5 Opus
51.2%-0.9
#7Command A
51.2%-0.9
#8Gemini 3 Pro
50.8%-1.3
#9GPT-5.1
50.6%-1.5

Full Results

All scores shown as accuracy percentages. Green indicates improvement over base model, red indicates regression.

ModelARCGPQAHellaSwagMMLUTruthfulQAWinoGrandeAvg
Base (Qwen3-4B)48.630.348.065.555.664.652.1
Gemini 2.5 Flash51.2(+2.6)35.4(+5.1)50.4(+2.5)66.2(+0.6)55.3(-0.3)65.6(+1.0)54.0(+1.9)
Claude 4.5 Opus48.1(-0.5)31.3(+1.0)49.6(+1.6)63.4(-2.2)52.6(-3.0)62.1(-2.4)51.2(-0.9)
Gemini 2.5 Pro48.5(-0.1)30.8(+0.5)48.5(+0.5)64.3(-1.2)54.4(-1.2)61.2(-3.4)51.3(-0.8)
GPT-5 Codex45.9(-2.7)43.9(+13.6)47.7(-0.3)62.5(-3.0)57.0(+1.5)61.3(-3.3)53.1(+1.0)
Kimi K245.8(-2.8)37.4(+7.1)49.2(+1.2)62.0(-3.5)52.5(-3.1)62.7(-1.9)51.6(-0.5)
GLM 4.648.7(+0.1)32.3(+2.0)48.3(+0.4)64.3(-1.3)53.1(-2.5)62.2(-2.4)51.5(-0.6)
GPT-5.147.8(-0.9)29.8(-0.5)48.0(+0.0)63.6(-1.9)55.7(+0.1)58.6(-5.9)50.6(-1.5)
Command A45.8(-2.8)31.8(+1.5)48.8(+0.8)63.5(-2.1)54.9(-0.7)62.2(-2.4)51.2(-0.9)
Gemini 3 Pro46.5(-2.1)34.9(+4.6)48.2(+0.2)62.4(-3.1)50.5(-5.1)62.3(-2.3)50.8(-1.3)

Methodology

Test Configuration

  • Quantization: 4-bit (matching typical deployment)
  • Framework: lm-eval harness
  • Temperature: 0.6
  • Top-p: 0.95

Benchmarks Used

  • ARC-Challenge: Science reasoning
  • GPQA Diamond: Graduate-level science
  • HellaSwag: Commonsense reasoning
  • MMLU: Multi-task language understanding
  • TruthfulQA: Truthfulness evaluation
  • WinoGrande: Pronoun resolution