mirror of
https://github.com/AgentSeal/codeburn.git
synced 2026-05-17 03:56:45 +00:00
4 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
daa673449c
|
Menubar and CLI hardening from multi-agent audit (#257)
Some checks are pending
CI / semgrep (push) Waiting to run
Two passes of validators across CLI accuracy, dashboard UX, menubar Swift, performance, security, and end-to-end smoke tests on real session data. Data-correctness fixes: - parseLocalDate rejects month/day overflow. JS Date silently rolled Feb 31 to Mar 3, so --from 2026-02-31 --to 2026-03-15 quietly dropped sessions on Feb 28 - Mar 2. Now throws "Invalid date" with a clear reason. Leap-day case covered (2024-02-29 valid, 2025-02-29 rejected). - CSV/JSON exports use the active currency's natural decimal places. The previous round2 helper produced ¥412.37 in CSV while the dashboard rendered ¥412 — finance teams comparing the two surfaces saw a discrepancy. New roundForActiveCurrency consults Intl.NumberFormat for the right precision (0 for JPY/KRW/CLP, 2 for USD/EUR, etc). - Copilot toolRequests is Array.isArray-guarded in both modern and legacy event branches. Previously a corrupt session with toolRequests=null or a string aborted the whole file's parse loop and silently dropped every legitimate call after it. - Codex token_count dedup uses a null sentinel for prevCumulativeTotal so the first event is never confused with a duplicate. Sessions that emit only last_token_usage (no total_token_usage) report cumulativeTotal=0 on every event; with the previous 0-initialized prev, the first event matched the dedup guard and was dropped. - LiteLLM pricing values are clamped to [0, 1] per token via safePerTokenRate. Defense in depth against a tampered upstream JSON shipping negative or absurdly large per-token costs that would otherwise propagate into all cost totals. Performance: - Cursor SQLite parse no longer pegs at minutes on multi-GB DBs. Two changes: per-conversation user-message buffer uses an index pointer instead of Array.shift() (which was O(n) per call); and a real ROWID cutoff via subquery limits the scan to the most recent 250k bubbles with a stderr warning so power users get a partial report rather than a stalled CLI. - Spawned codeburn CLI subprocesses are terminated when the calling Task is cancelled. Without this, rapid period/provider tab clicks in the menubar cancelled the Task but left the subprocess running to completion, piling up zombie processes. UX: - Dashboard period switch flips to loading and clears projects synchronously before reloadData runs, eliminating the frame where the new period label rendered over the old period's projects. - Optimize findings tab paginates 3-at-a-time with j/k scroll. With 4 new detectors plus 7 originals, 8-10 findings * 6 lines was scrolling the StatusBar off the alt buffer top. - Custom --from/--to ranges hide the period tab strip and disable the 1-5 / arrow keys so a stray period press no longer abandons the user's explicit range. A "Custom range: X to Y" banner replaces the tab strip. - OpenCode storage-format warning is per-table-set, rate-limited to once per process, and points the user at OpenCode's migration step or the issue tracker. The previous all-or-nothing check fired the generic "format not recognized" string for any schema mismatch. Menubar / OAuth: - Both Claude and Codex bootstrap (Reconnect button) now honour the usageBlockedUntil 429 backoff that refreshIfBootstrapped respects. Spamming Reconnect during sustained rate-limit windows previously hammered the upstream endpoint on every click. - Codex Retry-After HTTP header is parsed (delta-seconds plus IMF-fixdate fallback) so we don't over-back-off when ChatGPT tells us a shorter window than our 5-minute floor. - Both credential cache files are written via SafeFile.write (O_CREAT | O_EXCL | O_NOFOLLOW with explicit 0600) so there is no race window where the temp file briefly exists at default umask, and a symlink at the destination cannot redirect the write. Reads now route through SafeFile.read with a 64 KiB cap, closing the symlink-follow gap on Data(contentsOf:). CI signal: - TypeScript strict typecheck (tsc --noEmit) is now zero errors. The six errors in src/providers/copilot.ts came from a discriminated-union catch-all branch whose `data: Record<string, unknown>` shape TS picked over the specific event branches when narrowing on `type`. Removed the catch-all; runtime falls through unknown event types via the existing if/else chain. Tests added: 16 new (now 555 total) - date-range-filter: month/day/year overflow rejection, leap-day correctness - currency-rounding: convertCost no-rounding contract, roundForActiveCurrency for USD/JPY/KRW/EUR - providers/copilot: malformed toolRequests does not abort the parse - providers/cursor-bubble-dedup: re-parse after token mutation does not double-count, single parse yields one call per bubble - providers/codex: first event with cumulativeTotal=0 not dropped, consecutive zero-cumulative duplicates still deduped |
||
|
|
afd0ee7011
|
Validator hardenings on the bug-hunt batch (#254)
* Five correctness fixes from multi-agent bug hunt
A multi-agent audit of the codeburn correctness surface found five
real bugs each producing visibly wrong numbers or risking data loss.
All five fixes were validated by parallel review agents and exercised
end-to-end against real session data on this machine.
- src/cli.ts: --refresh <seconds> was using bare parseInt as the
commander callback. Commander invokes the callback as
parseInt(value, previous), so previous becomes the radix:
--refresh 30 was being parsed as parseInt('30', 30) = 90, and
--refresh 60 became NaN. Replaced with parseInteger (already
defined at line 48 with radix locked to 10) at all three sites.
- src/providers/cursor.ts: parseAgentKv was timestamping every
agentKv call as new Date().toISOString() because the Cursor
SQLite schema has no per-message timestamp. Result: every
Cursor agent call regardless of when it happened landed in
today's date bucket. Now uses statSync(dbPath).mtimeMs as a
bounded ceiling so calls land at the actual last-write time of
the Cursor database, not today. Verified locally: a 1904-call
Cursor history with March 22 mtime now correctly bucket into
all-time only and shows 0 calls for today/week/30days.
- src/providers/codex.ts: prev token counters were only updated
inside the cumulative-fallback branch, so a session emitting N
events with last_token_usage followed by one cumulative-only
event computed the next delta against prev=0 and double-counted
the entire cumulative window. Cost could be inflated 10-100x
for any mixed-format Codex session. Now prev advances to the
current cumulative state regardless of which branch ran.
- src/providers/gemini.ts: totalOutput accumulated output+thoughts
while totalThoughts was tracked separately. The result was
outputTokens = output+thoughts AND reasoningTokens = thoughts;
any consumer summing the two double-counted thoughts. Now
totalOutput holds just output, reasoningTokens holds thoughts,
and the cost calc folds thoughts into the output count to keep
pricing correct (Google bills thoughts at the output rate;
calculateCost has no reasoning parameter).
- src/export.ts: exportJson had no safety check before writeFile,
so codeburn export -f json -o ~/important.json would silently
clobber the user's file. CSV path had a marker-file guard; JSON
did not. Now refuses to overwrite a file unless its first 4KB
contain the codeburn schema marker. Uses a streaming partial
read so a large existing file does not OOM Node's ~512MB
string limit. Refuses directories outright.
Skipped intentionally: cursor-auto/copilot-auto/cline-auto/
qwen-auto are aliased to claude-sonnet-4-5. The audit flagged
this as wrong pricing for non-Anthropic auto-routed turns, but
Cursor's "auto" mode does not expose the actual model and any
alternative estimate is equally arbitrary. README already
documents this as a Sonnet-based estimate.
vitest run: 38 files, 529 tests pass.
* Five more correctness fixes from the bug-hunt round
This commit closes out the remaining critical-tier findings from the
multi-agent audit, with one item documented as a known limitation.
- src/providers/cursor.ts: bubble dedup key included mutable
inputTokens/outputTokens. Cursor mutates token counts on the row in
place when streaming completes, so re-parsing the same DB produced
a fresh dedup key per bubble and silently double-counted. Switched
to the SQLite row key (`bubbleId:<unique>`) which is stable per
bubble. Adjusted BubbleRow type and BUBBLE_QUERY_BASE to expose
`key as bubble_key`.
- src/providers/pi.ts: usage fields were destructured non-optionally,
but real Pi/OMP session files sometimes omit individual fields.
`calculateCost(model, undefined, ...)` returned NaN, and that NaN
propagated into every aggregate cost total. Coerce each field to
0 with `?? 0`.
- src/models.ts: getShortModelName and the getModelCosts startsWith
fallback both walked the dictionary in insertion order. A model id
like `gpt-5-mini` could resolve to the entry for `gpt-5` (matched
by startsWith first) and silently get GPT-5's display name and
pricing tier. Iterate longest keys first so more-specific prefixes
win. Tightened the cost fallback's match condition from
`startsWith(key) || startsWith(key + '-')` to require either an
exact match or a `key + '-'` continuation, removing accidental
matches like `gpt-50` against `gpt-5`.
- src/models.ts: calculateCost returned 0 silently for any model
missing from the pricing snapshot. New Anthropic / OpenAI models
shipped between snapshot refreshes look free until the user
notices. Now warns once per unknown model name per process to
stderr. Skips the warning for the `<synthetic>` placeholder so
the noise floor stays low.
- src/yield.ts: revert detection was broken on the canonical case.
Two problems: (1) `subject.toLowerCase().includes('revert')`
matched any commit whose subject mentioned the word ("Add revert
button" was misclassified). (2) The window logic only counted
reverts within the original session's 1-hour boundary, but real
`git revert` commits land in later sessions, so original sessions
always looked productive. Now: getRevertedShas runs once with
`--grep=^This reverts commit` and parses bodies to build a Set of
SHAs that were the target of a revert anywhere in history.
CommitInfo.wasReverted is set when this commit's SHA appears in
that set. categorizeSession then flags a session as reverted when
its in-main commits were later reverted, regardless of when the
revert itself happened.
- src/providers/droid.ts: SKIPPED with comment. Droid records token
usage only at session level. The current behavior splits evenly
across emitted assistant calls and prices all of them at
settings.model (the latest model). For sessions where the user
switched models mid-stream, costs are approximate. Added an
inline comment documenting this; a real fix requires per-message
model data that isn't in the Droid JSONL schema.
Verified end-to-end on this machine:
- vitest run: 38 files, 529 tests pass
- `codeburn report --format json` produces valid JSON
- `codeburn yield -p week` runs without crashing, finds 0 reverts
in the user's recent git history (plausible — fix changed the
detection from "subject contains revert" to "this commit's SHA
appears in a later 'This reverts commit ...' body")
- Stderr now warns for unknown model ids: `openai/gpt-5.3`,
`qwen3.6:35b-a3b-bf16`, `big-pickle`. These previously priced
silently at $0.
* Four high-severity fixes from the bug-hunt round
- src/currency.ts: getExchangeRate wrapped fetchRate and cacheRate in
one try/catch. If fetchRate succeeded but cacheRate threw (disk
full, ENOSPC, no permissions on the cache dir), the catch block
swallowed the error and returned 1. Every cost rendered after that
point became USD-equivalent silently. Now the fetch and the cache
write live in separate paths: a successful fetch returns the rate
even if the persist fails, and the cache-write error is dropped to
a fire-and-forget so transient disk problems do not corrupt the
user's currency display.
- src/cursor-cache.ts: writeFile was non-atomic. Two concurrent
codeburn invocations writing to cursor-results.json could
interleave bytes mid-write, leaving a truncated file that
parsed-error on next read and forced a full SQLite re-scan every
run. Switched to the temp-file + rename pattern with a randomized
temp name so each writer gets its own staging file and the rename
is atomic on POSIX. Crash mid-write also leaves only a leftover
temp file, which gets unlinked in the catch path; the destination
is never half-written.
- mac/.../CodeBurnApp.swift refresh loop on sleep: the loop's
Task.sleep keeps a wakeup pending across system sleep, so on wake
the natural tick fires the same instant the wake observers do.
Combined with didWakeNotification, screensDidWakeNotification, and
the launchd com.codeburn.refresh distributed notification, that
produced 2-3 concurrent CLI spawns within ms of every wake. Now:
willSleepNotification cancels the loop task; didWakeNotification
restarts it. The loop also reads lastRefreshTime and skips its
natural tick if a wake/manual/distributed-notification refresh ran
within the last 5 seconds, coalescing the two sources of refresh
into one CLI spawn per wake event.
- mac/.../CodeBurnApp.swift observeStore: the read closure had an
implicit strong self capture (it accessed store.* without a
capture annotation), pinning self for the lifetime of any
unfired observation. Added [weak self] and a guard to make the
capture explicit. withObservationTracking is one-shot per call,
so there is at most one active subscription at a time; the
earlier audit's claim of an unbounded leak overstated the issue,
but tightening the capture pattern is still cleaner.
Verified:
- vitest run: 38 files, 529 tests pass
- swift build -c release --arch arm64 --arch x86_64: clean, no
diagnostics, no MainActor warnings
- mac/Scripts/package-app.sh dev produces a valid universal bundle
- Menubar launches and runs without crash
* Eleven medium-severity fixes from the bug-hunt round
- src/format.ts formatTokens: guard against Infinity, NaN, and
negative input. Previously a corrupt aggregate could leak into
the UI as the literal strings "NaN" or "Infinity". Negatives now
render as "0" rather than "-500" with no scaling.
- src/cli-date.ts parseDateRangeFlags: the missing-from default
was new Date(0), which opened a 55-year scan from 1970 epoch
whenever the user passed only --to. Default now anchors at 6
months back from now, matching the dashboard's all-time period.
Test updated to assert the new bounded window.
- src/cli-date.ts toPeriod: previously fell back silently to "week"
for any unknown input, so a typo like `-p mounth` produced a
quiet 7-day report while the user thought they were viewing the
month. Now exits with a clear stderr error and exit code 1.
Test updated to assert the loud-failure behavior.
- src/optimize.ts urgencyScore: rebalanced weights so a high-impact
finding with zero observed tokens cannot outrank a medium-impact
finding with millions of tokens. Old 0.7/0.3 split made high+0
(0.70) beat medium+1B (0.65). New 0.5/0.5 split makes medium+1B
(0.75) beat high+0 (0.50). Token normalization lifted to 5M so
the ramp covers a realistic spend range.
- src/models.ts calculateCost: clamp negative or non-finite token
inputs to 0 before pricing. A corrupt JSONL emitting a negative
count would otherwise produce a negative cost that silently
subtracted from real spend in aggregates.
- src/currency.ts convertCost: stop rounding during aggregation.
For zero-fraction currencies (JPY, KRW, CLP) this clamped every
per-session cost to a whole unit before sum, so a project of
1000 sessions averaging ¥0.4 each aggregated to ¥0 instead of
¥400. formatCost still rounds at the display boundary.
- src/config.ts saveConfig: the temp file path was a fixed
`${configPath}.tmp` suffix. Two simultaneous saveConfig calls
(overlapping menubar and CLI runs) raced on the same staging
file and could leave one writer reading partial bytes from the
other. Randomized the temp suffix per call.
- src/providers/antigravity.ts flushCache: the early return on
`!cacheDirty` short-circuited eviction when liveCascadeIds was
supplied but no cascade had been added or updated this run. As
a result, deleted .pb files persisted in the cache forever once
the user stopped writing to it. Eviction now runs whenever
liveCascadeIds is provided, marks the cache dirty if anything
was removed, and only then short-circuits if there is nothing
to write.
- src/daily-cache.ts addNewDays: cap retention at 2 years. The
days array previously merged forever, growing the cache file by
hundreds of bytes per day until JSON parse on every CLI
invocation became measurable. The 6-month UI period plus the
365-day BACKFILL_DAYS bootstrap both fit comfortably inside the
cap, with headroom for a future longer window.
- src/dashboard.tsx useInput: period number keys (1-5) and arrow
keys triggered a reload while the compare view was mounted. The
parent's data state changed underneath the user with no visual
affordance back to the dashboard. Now those keys are gated on
view !== 'compare', and `b` / Esc inside compare returns to the
dashboard.
- mac/.../HeatmapSection.swift formatters: prettyDate, buildTrend
Bars, computeTrendStats, computeForecast, and computeAllStats
each allocated a fresh DateFormatter (and Calendar) on every
call. SwiftUI re-evaluates these views many times per second
during hover scrubbing on the trend chart, so the allocations
were a measurable hot spot. Lifted the yyyy-MM-dd / "EEE MMM d"
/ "MMM d" formatters and the gregorian Calendar to fileprivate
cached singletons.
Two findings from the same bucket were not addressed here:
- UpdateChecker SHA-256 / codesign verification is already
performed by src/menubar-installer.ts (verifyChecksum at line
85). The Swift side just kicks off `codeburn menubar --force`
which runs that path. The audit's claim of missing verification
was a misread.
- NSDistributedNotificationCenter sender validation: the
`com.codeburn.refresh` listener accepts from any sender, but
forceRefresh has a 5-second rate-limit gate so the abuse
ceiling is one CLI spawn per 5 seconds. Mitigations (Mach IPC,
per-launch shared secret) are disproportionate to the impact.
vitest run: 38 files, 529 tests pass.
swift build -c release: clean, no warnings.
* Validator hardenings on the bug-hunt batch
Hoist the per-call sort in getModelCosts and getShortModelName to module
scope so model lookups on the hot path stop reallocating sorted key arrays.
Sanitize the unknown-model stderr warning by stripping C0/C1 controls
and capping length, so a hostile or corrupt JSONL cannot inject terminal
escape sequences via the model field.
Skip the daily-cache prune when newestDate fails to parse. The previous
code produced a NaN cutoff and silently dropped every cached day on the
next merge.
Adds tests locking down the stable resolution of common model names
(gpt-5-mini vs gpt-5, claude-haiku-4-5 vs claude-3-5-haiku, etc.) and
the prune NaN guard.
|
||
|
|
fc4c4f0091
|
feat(export): support custom date ranges | ||
|
|
c634b10560
|
feat(report): add --from/--to date range filtering and avgCostPerSession (#80)
* test(cli): failing tests for parseDateRangeFlags helper * feat(cli): add parseDateRangeFlags helper with local-time dates * feat(report): add --from/--to date range filtering * feat(report): add avgCostPerSession to JSON report and CSV/JSON export |