* docs(sparse-attn): plain-language README intro, SEO, and tutorial gist
- Rewrite README opening for non-experts: what it is, why it matters,
who it's for, what it is NOT. Adds a Table of Contents and an FAQ.
- Document the new FastGRNN-gated near-linear path with a measured
scaling table and runnable example pointer.
- Add SEO-friendly keyword block at the bottom (rust llm inference,
sparse attention rust, near-linear attention, edge ai rust,
raspberry pi llm, gguf rust, mistral / llama / smollm2 / phi-2).
- New docs/TUTORIAL.md walks through the full pipeline end-to-end
(Cargo.toml → forward → KvCache decode → FP16 KV → FastGRNN gate
→ cross-compile to Pi). Published as
https://gist.github.com/ruvnet/790214c832928d6f2ec7ebe593bb3def
Co-Authored-By: claude-flow <ruv@ruv.net>
* chore(sparse-attn): add crates.io metadata for v0.1.0 publish
- repository, documentation, homepage URLs
- keywords (llm, attention, transformer, inference, edge)
- categories (algorithms, science, mathematics)
- expanded description mentioning subquadratic + FastGRNN near-linear
- rust-version = 1.77 (matches workspace MSRV)
Published v0.1.0 to crates.io: https://crates.io/crates/ruvllm_sparse_attention
Co-Authored-By: claude-flow <ruv@ruv.net>
* feat(sparse-attn): FastGRNN salience gate + forward_gated for near-linear scale
Adds a recurrent O(N · D_h²) FastGRNN pass that produces a per-token
salience score, then prunes the sparse-attention candidate set against
that score. Combined cost is O(N · (D_h² + W + G + K_keep + dim)),
linear in seq when the gate budget K_keep is constant.
New module `fastgrnn_gate`:
- FastGrnnGate cell (matches cognitum-agent's sparse_fastgrnn math
so weights round-trip via from_weights / score_sequence)
- score_sequence / score_kv: per-position salience over a sequence
- keep_mask_quantile / keep_mask_top_k: turn salience into a binary
keep-mask the attention candidate selector consumes
- step_with_hidden: streaming variant for online inference
New methods on SubquadraticSparseAttention:
- forward_gated(q, k, v, keep_mask) — drops below-threshold tokens
from the long-range candidate set; window + globals + current
are always retained (causality preservation)
- forward_gated_with_fastgrnn(q, k, v, gate, top_k) — convenience
wrapper that does FastGRNN scoring + top-K masking + gated forward
Tests (5 new + 8 gate tests, all passing alongside 25 baseline):
- all-true mask is bit-identical to plain forward
- all-false mask preserves window + globals + current, output finite
- wrong mask length returns InvalidConfig
- smaller top_k provably reduces total candidate count
- end-to-end FastGRNN-driven path produces finite output
Scaling demo (examples/fastgrnn_gated_scaling.rs):
seq | ungated/N | gated/N | growth ratio
----|-----------|---------|-------------
128 | 0.0021 | 0.0029 |
2048| 0.0029 | 0.0036 |
ungated grows ~1.38× over 16× seq (log-linear);
gated grows ~1.24× over 16× seq (sub-logarithmic, near-linear).
Zero new runtime dependencies (ADR-183 invariant preserved).
Co-Authored-By: claude-flow <ruv@ruv.net>
* feat(sparse-attn): no_std + alloc support, ESP32-S3 cross-compile verified
ADR-192 implementation. Crate is now no_std + alloc behind a default-on
`std` feature (purely additive — std consumers see zero behavioural change).
Changes:
- lib.rs: #![cfg_attr(not(feature = "std"), no_std)] + extern crate alloc
- F32Ext trait restores .exp/.sqrt/.tanh/.powi method syntax via libm
in no_std mode; std mode uses inherent f32 methods unchanged
- attention.rs / fastgrnn_gate.rs / tensor.rs: replace std:: with
core:: and alloc:: imports; HashSet → BTreeSet (no hashing in no_std)
- Error trait impl gated on std (core::error::Error needs MSRV bump)
- Cargo.toml: std default-on, parallel = ["std", "rayon"], libm always-on
Verified:
- cargo test --lib 38/38 pass
- cargo build --no-default-features clean
- cargo build --no-default-features --features fp16 clean
- cargo +esp build --target xtensa-esp32s3-none-elf 1.02s release,
376 KB rlib
- examples/esp32s3_smoke runs natively all checks passed
Tested against attached hardware: ESP32-S3 v0.2, MAC ac:a7:04:e2:66:24,
16 MB flash, on /dev/ttyACM0 (USB-Serial-JTAG).
Bump version 0.1.0 → 0.1.1 (patch — additive). Adds "no-std" to crates.io
categories. Adds libm 0.2 as always-on dep (~60 KB, pure Rust).
Co-Authored-By: claude-flow <ruv@ruv.net>
* docs(adr): ADR-191 Pi Zero 2W production hardening for ruvllm_sparse_attention
Proposes four additive changes to the sparse-attention crate based on
production data from the cognitum-agent deployment on cognitum-v0
(Pi Zero 2W, SmolLM2-135M Q4_0, cognitum-one/seed PR #133):
1. decode_step_with_deadline / decode_step_f16_with_deadline /
decode_batch_with_deadline — sub-step wall-clock deadline so
integrators can bound latency at finer granularity than per-token.
Returns AttentionError::DeadlineExceeded { elapsed_ms, checkpoint }.
2. SparseAttentionConfig::pi_zero_2w() — codify the empirically
validated window=64, tile=16, FP16 KV preset that cognitum-agent
currently records as a Cargo.toml comment.
3. SubquadraticSparseAttention::warm_up() — synthetic 1-token decode
to prime caches and shrink the measured 99 s → 56 s cold→warm gap
before the first user inference.
4. Stochastic Q4 dequant pass-through for KV cache reload (feature-gated,
off by default). Reuses the splitmix64 seeding pattern from
cognitum-agent commit 1675c20 — naive `seed | 1` xorshift collapses
adjacent seeds 42 and 43 to the same state, an outright bug.
Status: proposed. Test plan covers correctness (deadline does not
perturb output), unbiasedness (mean within 0.06 of deterministic over
256 trials), and a cluster bench comparing pre/post cold first-decode
latency on cognitum-v0.
Co-Authored-By: claude-flow <ruv@ruv.net>
* style(sparse-attn): cargo fmt over crate sources after no_std refactor
Co-Authored-By: claude-flow <ruv@ruv.net>
---------
Co-authored-by: ruvnet <ruvnet@gmail.com>