supermemory/apps/docs/memorybench/memscore.mdx
vorflux[bot] fc94dd8308
Add MemScore documentation to memorybench docs (#797)
Co-authored-by: Vorflux AI <noreply@vorflux.com>
2026-03-23 15:19:40 -07:00

120 lines
4.1 KiB
Text

---
title: "MemScore"
description: "A composite metric for comparing memory providers across quality, latency, and token efficiency"
---
## What is MemScore?
MemScore is a composite metric that captures three dimensions of memory provider performance in a single line:
```
accuracy% / latencyMs / contextTok
```
For example:
```
85% / 120ms / 1500tok
```
This tells you the provider achieved **85% accuracy**, with an average search latency of **120ms**, sending **1,500 tokens** of context to the answering model per question.
## Components
| Component | What it measures | Source |
|-----------|-----------------|--------|
| **Quality** | Answer accuracy as a percentage | `(correct / total) * 100` from judge evaluations |
| **Latency** | Average search response time in milliseconds | Mean of all search phase durations |
| **Tokens** | Average context tokens sent to the answering model | Client-side token count of retrieved context per question |
<Note>
MemScore is not a single number — it's a triple. This is intentional. Collapsing quality, latency, and cost into one score hides important tradeoffs. A provider with 90% accuracy at 5,000 tokens is very different from one with 90% accuracy at 500 tokens.
</Note>
## How token counting works
MemoryBench counts tokens client-side using provider-specific tokenizers:
| Model provider | Tokenizer | Method |
|----------------|-----------|--------|
| **OpenAI** | `js-tiktoken` | Exact count using `o200k_base` or `cl100k_base` encoding |
| **Anthropic** | `@anthropic-ai/tokenizer` | Exact count using Anthropic's tokenizer |
| **Google** | Approximation | `Math.ceil(text.length / 4)` |
Three token values are tracked per question:
- **`promptTokens`** — Total tokens in the full prompt (instructions + context + question)
- **`basePromptTokens`** — Tokens in the prompt without any retrieved context
- **`contextTokens`** — Tokens in just the retrieved context string
The MemScore uses `contextTokens` because it isolates what the memory provider actually contributed.
## Where MemScore appears
### CLI output
After a benchmark run completes, MemScore is printed in the summary:
```
SUMMARY:
Total Questions: 50
Correct: 43
Accuracy: 86.00%
Quality: 86%
Latency: 145ms (avg)
Tokens: 1,823 (avg context sent to answering model)
MemScore: 86% / 145ms / 1823tok
```
### Web UI
The MemScore card appears at the top of the run overview page. Per-question token counts are shown next to each model answer in both the question list and detail views.
### Report JSON
The `report.json` file includes both a display string and structured components:
```json
{
"memscore": "86% / 145ms / 1823tok",
"memscoreComponents": {
"quality": 86,
"latencyMs": 145,
"contextTokens": 1823
},
"tokens": {
"totalTokens": 142500,
"basePromptTokens": 21000,
"contextTokens": 91150,
"avgTokensPerQuestion": 2850,
"avgBasePromptTokens": 420,
"avgContextTokens": 1823
}
}
```
Use `memscoreComponents` for programmatic comparisons — it avoids parsing the display string.
## Comparing providers
MemScore is most useful when comparing providers on the same benchmark:
```bash
bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -j gpt-4o
```
Each provider's report will include its own MemScore, making it easy to see tradeoffs at a glance:
| Provider | MemScore |
|----------|----------|
| Provider A | `88% / 145ms / 1200tok` |
| Provider B | `82% / 80ms / 2400tok` |
| Provider C | `85% / 110ms / 1800tok` |
In this example, Provider A has the highest accuracy but the slowest search. Provider B is the fastest but sends the most context without achieving the best accuracy — suggesting its retrieval may be less precise. Provider C lands in the middle on all three axes. There's no single "winner" — the right choice depends on whether you prioritize quality, speed, or token efficiency.
## Backward compatibility
Runs from before MemScore was added will still work. If token data is not present in the checkpoint, the `memscore`, `memscoreComponents`, and `tokens` fields will be `undefined` in the report. The CLI and web UI gracefully skip the MemScore display when data is unavailable.