koboldcpp/examples/llama-eval
Concedo 7d987af23a Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/cann.Dockerfile
#	.devops/cpu.Dockerfile
#	.devops/cuda.Dockerfile
#	.devops/intel.Dockerfile
#	.devops/llama-cli-cann.Dockerfile
#	.devops/musa.Dockerfile
#	.devops/openvino.Dockerfile
#	.devops/rocm.Dockerfile
#	.devops/s390x.Dockerfile
#	.devops/vulkan.Dockerfile
#	.github/ISSUE_TEMPLATE/011-bug-results.yml
#	.github/ISSUE_TEMPLATE/019-bug-misc.yml
#	.github/workflows/build-and-test-snapdragon.yml
#	.github/workflows/docker.yml
#	.github/workflows/server-self-hosted.yml
#	.github/workflows/ui-ci.yml
#	.pi/gg/SYSTEM.md
#	README.md
#	common/arg.cpp
#	docs/backend/SYCL.md
#	docs/backend/snapdragon/CMakeUserPresets.json
#	docs/backend/snapdragon/README.md
#	docs/speculative.md
#	examples/save-load-state/save-load-state.cpp
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/CMakeLists.txt
#	ggml/src/ggml-hexagon/htp/htp-ctx.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/rope-ops.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/gated_delta_net.wgsl
#	tools/cli/README.md
#	tools/server/README.md
2026-05-20 18:48:34 +08:00
..
llama-eval.py Merge branch 'upstream' into concedo_experimental 2026-05-20 18:48:34 +08:00
llama-server-simulator.py Merge branch 'upstream' into concedo_experimental 2026-05-20 18:48:34 +08:00
README.md examples : add llama-eval (#21152) 2026-05-12 15:07:00 +03:00
test-simulator.sh need to fix cuda compile. Merge branch 'upstream' into concedo_experimental 2026-05-12 20:47:07 +08:00

llama-eval

Simple evaluation tool for llama.cpp with support for multiple datasets.

For a full description, usage examples, and sample results, see:

Quick start

# Single server
python3 llama-eval.py \
  --server http://localhost:8033 \
  --model my-model \
  --dataset gsm8k --n_cases 100 \
  --grader-type regex --threads 32

# Multiple servers (comma-separated URLs and thread counts)
python3 llama-eval.py \
  --server http://server1:8033,http://server2:8033 \
  --server-name server1,server2 \
  --threads 16,16 \
  --dataset aime2025 --n_cases 240 \
  --grader-type regex