koboldcpp/examples/llama-eval
Georgi Gerganov d5dc2e0a02
llama-eval : add AIME 2026 dataset support (#23058)
Add Aime2026Dataset class loading from MathArena/aime_2026 on
HuggingFace. 30 problems (two sets of 15), single config/split.

Usage: --dataset aime2026

Assisted-by: llama.cpp:local pi
2026-05-15 13:58:30 +03:00
..
llama-eval.py llama-eval : add AIME 2026 dataset support (#23058) 2026-05-15 13:58:30 +03:00
llama-server-simulator.py examples : add llama-eval (#21152) 2026-05-12 15:07:00 +03:00
README.md examples : add llama-eval (#21152) 2026-05-12 15:07:00 +03:00
test-simulator.sh examples : add llama-eval (#21152) 2026-05-12 15:07:00 +03:00

llama-eval

Simple evaluation tool for llama.cpp with support for multiple datasets.

For a full description, usage examples, and sample results, see:

Quick start

# Single server
python3 llama-eval.py \
  --server http://localhost:8033 \
  --model my-model \
  --dataset gsm8k --n_cases 100 \
  --grader-type regex --threads 32

# Multiple servers (comma-separated URLs and thread counts)
python3 llama-eval.py \
  --server http://server1:8033,http://server2:8033 \
  --server-name server1,server2 \
  --threads 16,16 \
  --dataset aime2025 --n_cases 240 \
  --grader-type regex