Commit graph

6 commits

Author SHA1 Message Date
Concedo
9203b6a051 Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/labeler.yml
#	.github/workflows/build-self-hosted.yml
#	.github/workflows/release.yml
#	.github/workflows/server-sanitize.yml
#	.github/workflows/server-self-hosted.yml
#	.github/workflows/server.yml
#	.github/workflows/ui-build.yml
#	.github/workflows/ui-ci.yml
#	.github/workflows/ui-publish.yml
#	.gitignore
#	CMakeLists.txt
#	CODEOWNERS
#	scripts/ui-download.cmake
#	scripts/xxd.cmake
#	tests/test-backend-ops.cpp
#	tests/test-reasoning-budget.cpp
#	tools/CMakeLists.txt
#	tools/server/CMakeLists.txt
#	tools/server/README.md
2026-05-16 22:56:33 +08:00
Georgi Gerganov
d5dc2e0a02
llama-eval : add AIME 2026 dataset support (#23058)
Add Aime2026Dataset class loading from MathArena/aime_2026 on
HuggingFace. 30 problems (two sets of 15), single config/split.

Usage: --dataset aime2026

Assisted-by: llama.cpp:local pi
2026-05-15 13:58:30 +03:00
Concedo
cc82c3164e Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.devops/intel.Dockerfile
#	.github/workflows/build-cross.yml
#	.github/workflows/build-sycl.yml
#	.github/workflows/build.yml
#	.github/workflows/editorconfig.yml
#	.github/workflows/release.yml
#	cmake/riscv64-spacemit-linux-gnu-gcc.cmake
#	docs/backend/OPENVINO.md
#	docs/backend/SYCL.md
#	docs/build-riscv64-spacemit.md
#	docs/ops.md
#	docs/ops/WebGPU.csv
#	embd_res/ggml-vocab-qwen35.gguf
#	embd_res/ggml-vocab-qwen35.gguf.inp
#	embd_res/ggml-vocab-qwen35.gguf.out
#	examples/model-conversion/Makefile
#	ggml/CMakeLists.txt
#	ggml/src/ggml-cpu/CMakeLists.txt
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp
#	ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c
#	ggml/src/ggml-hexagon/htp/hmx-matmul-ops.c
#	ggml/src/ggml-hexagon/htp/hmx-utils.h
#	ggml/src/ggml-hexagon/htp/htp-ops.h
#	ggml/src/ggml-hexagon/htp/hvx-utils.h
#	ggml/src/ggml-hexagon/htp/main.c
#	ggml/src/ggml-hexagon/htp/unary-ops.c
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	ggml/src/ggml-sycl/CMakeLists.txt
#	ggml/src/ggml-sycl/common.cpp
#	ggml/src/ggml-sycl/common.hpp
#	ggml/src/ggml-sycl/ggml-sycl.cpp
#	ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp
#	ggml/src/ggml-webgpu/ggml-webgpu.cpp
#	ggml/src/ggml-webgpu/wgsl-shaders/common_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_tile.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_reduce.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/get_rows.wgsl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_vec_acc.tmpl
#	ggml/src/ggml-webgpu/wgsl-shaders/unary.wgsl
#	ggml/src/ggml-zendnn/CMakeLists.txt
#	ggml/src/ggml-zendnn/ggml-zendnn.cpp
#	scripts/snapdragon/adb/run-completion.sh
#	tests/CMakeLists.txt
#	tools/cli/README.md
#	tools/completion/README.md
#	tools/mtmd/clip-impl.h
#	tools/mtmd/clip.cpp
#	tools/mtmd/clip.h
#	tools/server/README.md
2026-05-14 19:04:04 +08:00
Sigbjørn Skjæret
bcfe63fc53
llama-eval : enable type check (#22988) 2026-05-13 09:14:24 +03:00
Concedo
f7923b261f need to fix cuda compile. Merge branch 'upstream' into concedo_experimental
# Conflicts:
#	.github/workflows/python-type-check.yml
#	examples/speculative-simple/README.md
#	examples/speculative-simple/speculative-simple.cpp
#	ggml/src/ggml-cuda/im2col.cu
#	ggml/src/ggml-opencl/CMakeLists.txt
#	ggml/src/ggml-opencl/ggml-opencl.cpp
#	ggml/src/ggml-opencl/kernels/cvt.cl
#	tests/test-backend-ops.cpp
#	tools/cli/README.md
#	tools/mtmd/CMakeLists.txt
#	tools/server/README.md
2026-05-12 20:47:07 +08:00
Georgi Gerganov
fde69a3607
examples : add llama-eval (#21152)
Some checks failed
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
* working llama-eval mc and math suite

* multi source llama-eval

* Add readme

* add checkpointing

* examples: add llama-server simulator for testing eval scripts

Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:

- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting

Also includes test scripts and documentation for testing and understanding
the simulator functionality.

* examples: refactor test-simulator.sh for better readability

Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.

* docs: update llama-eval-discussion.md with session work summary

Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.

* examples: add simplified llama-eval-new.py for AIME evaluation

- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers

* docs: remove README.md from llama-eval

* examples: implement flexible grader system for answer validation

- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers

* examples: use HF_HUB_OFFLINE to avoid HF Hub warnings

* examples: remove HF_HUB_OFFLINE to allow dataset download

* examples: use cached dataset path to avoid HF Hub requests

* examples: use cached dataset path in simulator to avoid HF Hub requests

* docs: update llama-eval-discussion.md with session work summary

* examples: add threading support and model parameter to llama-eval-new.py

- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution

* docs: update llama-eval-discussion.md with threading and model parameter updates

- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features

* examples: add task summary table to llama-eval-new.py

* eval : print progress

* eval : add prompts

* test : fix path

* sim : fix answer matching

* eval : support multiple dataset runs

* minor

* improve grader

* docs

* remove old files

* datasets : add gsm8k

* add gpqa + sampling + docs

* rename

* grader : improve example answers

* cont

* datasets : add aime2025

* grader : update prompt

* grade : improve regex + logs

* datasets : fix aime2025

* cleanup

* add AGENTS.md

* ignore errors

* resume eval

* cleanup

* fix counts

* simplify

* fix prompts

* add html

* store full response

* add tokens

* resoning and error handling

* refactor

* track total time

* remove junk

* eval : unify "judge" terminology to "grader"

Replace all occurrences of "judge" with "grader" for consistency
across the codebase (CLI args, Grader class fields, help text).

Assisted-by: llama.cpp:local pi

* eval : add Wilson score confidence interval to results

Compute 95% CI on-the-fly from completed cases. Displayed in
terminal output, HTML report, and JSON state.

* llama-eval : add per-task generation speed from server timings

Extract predicted_per_second from the server timings response and store
it as tps_gen per task. Display in console progress, print_all_tasks,
and HTML report.

Assisted-by: llama.cpp:local pi

* llama-eval : add per-task generation time from server timings

Extract predicted_ms from the server timings response and store it as
t_gen_ms per task. Display in seconds with one decimal digit in console
progress, print_all_tasks, and HTML report.

Assisted-by: llama.cpp:local pi

* llama-eval : rename display, escaped, and count variables to use prefix convention

- _display suffix → display_ prefix (answer, tokens, tps, t_gen)
- _escaped suffix → escaped_ prefix (response, prompt, reasoning)
- _count suffix → n_ prefix (correct, incorrect, pending)

Assisted-by: llama.cpp:local pi

* llama-eval : support multiple evaluation endpoints with dynamic task distribution

- Add ServerConfig dataclass (url, threads, name)
- Accept comma-separated --server, --threads, --server-name CLI args
- Dynamic shared-queue task distribution across servers (fast servers do more work)
- One ThreadPoolExecutor per server, workers pull from shared Queue
- Track which server processed each task (server_name in results)
- Thread-safe EvalState with threading.Lock for concurrent mutations
- Server column in HTML report and console output
- Backward compatible: single server works as before

Assisted-by: llama.cpp:local pi

* llama-server-simulator : replace Flask with stdlib http.server

- Use HTTPServer + BaseHTTPRequestHandler instead of Flask
- RequestHandler handles POST /v1/chat/completions
- Server runs in daemon thread with clean Ctrl+C shutdown
- Remove flask and unused asdict imports

Assisted-by: llama.cpp:local pi

* llama-eval : update README with PR link and quick-start examples

Assisted-by: llama.cpp:local pi

* llama-eval : track model name in eval state and verify on resume

- Store model_name in EvalState and JSON output
- Display model in HTML summary table
- Verify --model matches stored model when resuming

Assisted-by: llama.cpp:local pi

* llama-server-simulator : fix comment - Dice coefficient, not Levenshtein

Assisted-by: llama.cpp:local pi

* llama-eval : require --grader-model or --model when using --grader-type llm

Assisted-by: llama.cpp:local pi

* llama-eval : protect dump() with lock for thread safety

Assisted-by: llama.cpp:local pi

* llama-eval : compact HTML report output

- Replace verbose summary table with single inline bar
- Shorten status text: '✓'/'✗'/'–'/'!' instead of full words
- Flatten CSS: remove box-shadows, border-radius, reduce padding
- Use system-ui font, 13px table, 12px details
- Conditional reasoning section (only shown when present)
- Single toggle JS function instead of two
- Shorter column headers

Assisted-by: llama.cpp:local pi

* llama-eval : check server connectivity on startup

- Hit /v1/models for each server before evaluation
- Exit with error if any server is unreachable
- Print comma-separated model IDs per server in startup output
- Sequential checks, no retries, no timeout override

Assisted-by: llama.cpp:local pi

* llama-eval : use server1/server2 instead of gpu1/gpu2 in README

Assisted-by: llama.cpp:local pi

---------

Co-authored-by: gatbontonpc <gatbontonpc@gmail.com>
2026-05-12 15:07:00 +03:00