Delete model-efficiency-scorecard.md

2026-07-09 17:08:29 +00:00 · 2026-05-11 11:50:30 +02:00 · 2026-05-11 11:50:30 +02:00 · a972f870b0
commit a972f870b0
parent f17198e126
1 changed files with 0 additions and 67 deletions
--- a/docs/agents/model-efficiency-scorecard.md
+++ b/docs/agents/model-efficiency-scorecard.md
@ -1,67 +0,0 @@
-# Agent Zero Model Efficiency Scorecard
-
-Date: 2026-05-10
-
-This scorecard synthesizes the untracked `EVIDENCE-*.md` provider sweeps. It is not a general intelligence benchmark. It scores how efficiently each provider/model pairing drove Agent Zero's live tool suite under short, median-user prompts.
-
-## Scoring
-
-Score is a 0-100 review score based on:
-
- first-pass correct tool routing
- successful completion without repair turns
- local/server vs host/remote locality discipline
- correct use of skill-gated tools
- provider support for vision and remote/CUA surfaces
- low final-answer drift, repeated tool loops, and hallucinated success
-
-Repairs, partial routing, provider-specific limitations, unnecessary tool probes, and cleanup hazards lower the score even when the final user-visible answer was correct.
-
-## Chart
-
-| Rank | Provider / model | Score | Chart | Main read |
-|---:|---|---:|---|---|
-| 1 | Nebius AI / `Qwen/Qwen3.5-397B-A17B` | 96 | `###################-` | Best broad reliability; most tools passed, with repair noise mostly confined to patch/scheduler. |
-| 2 | A0 Venice / `z-ai-glm-5v-turbo` | 92 | `##################--` | Strong unified-tool behavior; Venice image validation and patch repair remain the main costs. |
-| 3 | OpenRouter / `anthropic/claude-sonnet-4.6` | 89 | `##################--` | Cleanest locality and patching; remaining gaps are higher-level routing and semantic scope. |
-| 4 | OpenRouter / `xiaomi/mimo-v2.5-pro` | 88 | `##################--` | Strong explicit-tool performer; natural remote patch and unsupported vision endpoint hurt reliability. |
-| 5 | Moonshot AI / `kimi-2.6` | 87 | `#################---` | Good primitives and scheduler; skill read_file, document query routing, and vision perception were weaker. |
-| 6 | OpenRouter / `google/gemini-3.1-flash-lite` | 85 | `#################---` | Strong tool surface, but behavior/memory persistence leaked temporary rules into later chats. |
-| 7 | A0 Venice / `google-gemma-4-26b-a4b-it` | 83 | `#################---` | Good for a smaller open model after prompt cleanup; memory wording, scheduler, and Venice vision were fragile. |
-| 8 | OpenRouter / `anthropic/claude-haiku-4.5` | 81 | `################----` | Usable when explicit; natural memory, scheduler shape, CUA status, and notify defaults are less reliable. |
-| 9 | SambaNova / `MiniMax-M2.7` | 71 | `##############------` | Concrete tools often worked, but provider noise, natural memory, notify loops, CUA, and vision were risky. |
-| 10 | OpenRouter / `openai/gpt-4.1-mini` | 67 | `#############-------` | Basic syntax is competent, but natural routing often looks successful while using the wrong tool path. |
-| 11 | Nebius AI / `nvidia/Nemotron-3-Nano-Omni` | 66 | `#############-------` | Most affected by stale final answers, local/remote confusion, memory misses, and patch instability. |
-
-## Cross-Provider Failure Clusters
-
- Natural document questions often use `text_editor` instead of `document_query`.
- Skill requests sometimes jump straight to `load` and skip `search`, even when the user asks to find a skill.
- Scheduler reminders often start with ISO timestamps in `schedule` instead of cron fields or `plan`.
- Normal notifications often omit `priority: 10` or use `success` styling for plain notes.
- Patch requests are still cognitively expensive for smaller models, especially after locality ambiguity.
- Natural memory requests conflict with prompt-include guidance unless the prompt separates durable memory from project instruction files.
- Behavior adjustments can leave vector-memory residue even when `behaviour.md` is restored.
- Vision reliability is provider-specific; some endpoints reject image input or validate embedded media differently.
- CUA should stay skill-gated and beta; status is generally safer than capture/action testing.
-
-## Improvements Applied From This Pass
-
- Renamed high-impact skills to task-oriented names and moved plugin-owned skills into their plugin folders.
- Updated skill frontmatter on renamed skills toward the official `name` + `description` standard.
- Clarified memory-vs-promptinclude guidance so "remember/forget" routes to memory tools.
- Clarified scheduler one-time vs cron task shapes and timezone handling.
- Clarified `document_query` as the preferred document-QA tool for document paths and URLs.
- Clarified `skills_tool` search/load/read_file order and required arguments.
- Clarified normal notification priority/type.
- Clarified local and host text-editor patch guidance when the user says not to rewrite.
-
-## Next Candidates
-
- Make memory delete/forget clean up derived fragments more predictably, or expose fragment cleanup as an explicit backend result.
- Add behavior-adjustment scoping and cleanup protection so temporary rules do not become persistent vector memories.
- Consider a scheduler convenience path for one-time reminders that maps ordinary time phrases to `create_planned_task` without relying on model cron synthesis.
- Add provider capability checks before `vision_load` injects image content into text-only or image-rejecting endpoints.
- Improve patch ergonomics with a simpler replace-by-exact-text path for both local and host file editors.
- Add notify de-duplication/loop protection at the tool layer for repeated identical notifications.
- Improve A2A error text around required response history so models do not treat `(no response)` as success.