OmniRoute/docs/TROUBLESHOOTING.md
Diego Rodrigues de Sa e Souza 1442c47bbb
chore(release): v3.5.6 — email masking, model toggle, OpenRouter registries & bug fixes (#1080)
* fix(minimax): switch auth from x-api-key to Authorization Bearer (#1076)

Integrated into release/v3.5.6 — MiniMax auth fix with authHeader consistency normalization

* feat(CI,i18n): autogenerate language files + Add missing strings (#1071)

Integrated into release/v3.5.6 — i18n translations for memory, skills, and missing keys across 31 languages

* fix(ci): restore i18n continue-on-error, remove auto-commit race condition

* fix(husky): load nvm in hooks for VS Code compatibility

* fix(husky): gracefully skip hooks when npm is not in PATH

* fix: convert OpenAI function tool_choice to Claude tool format (#1072)

* fix: prevent EPIPE feedback loop filling logs at GB/s (#1006)

* fix: fallback to native fetch when undici dispatcher fails (#1054)

* fix: improve Qoder PAT validation with actionable error messages (#966)

- Add QODER_PERSONAL_ACCESS_TOKEN env var fallback for both validation and execution
- Pre-flight ping check to diagnose connectivity issues (Docker/proxy)
- Detect encrypted auth blobs from ~/.qoder/.auth/user and guide to website PAT
- Clear error messages for auth failures with link to integrations page
- Treat non-auth 4xx as auth-pass (request format issue, not token issue)
- Update tests to cover new validation paths (23 tests, all passing)

* feat: Improve the Chinese translation (#1079)

Integrated into release/v3.5.6

* chore(release): v3.5.6 — i18n updates and credential security fixes

* fix(ci): resolve e2e and docs-sync pipeline failures

* fix(security): bump next to 16.2.3 to resolve SNYK-JS-NEXT-15954202

* fix: guard Memory/Cache UI against null toLocaleString crash (#1083)

* fix: translate OpenAI tool_choice type 'function' to Claude 'tool' format (#1072)

* fix: pass custom baseUrl in provider API key validation (#1078)

* docs: update CHANGELOG with v3.5.6 bug fixes and security patches

* docs: rewrite implement-features workflow with 5-phase harvest-research-report-plan-execute pipeline

* docs: organize _ideia/ into viable/defer/notfit + add Phase 2.5 auto-response workflow

* docs: implementation plans for #1025, #750, #960, #1046 + close already-implemented #833, #973, #982

* feat: mask email addresses in dashboard for privacy (#1025)

* feat: add OpenRouter and GitHub to embedding/image provider registries (#960)

* feat: add model visibility toggle and search filter to provider page (#750)

* docs: move implemented features to notfit, update task plans status

* chore: untrack _ideia/ and _tasks/ from git — private/internal only

* chore(release): bump to v3.5.6 — changelog, docs, version sync & any-budget fix

* fix: remove explicit .ts extension in qoderCli import that caused 500 error in production build

---------

Co-authored-by: Jean Brito <jeanfbrito@gmail.com>
Co-authored-by: zenobit <zenobit@disroot.org>
Co-authored-by: diegosouzapw <diegosouzapw@users.noreply.github.com>
Co-authored-by: Ethan Hunt <136065060+only4copilot@users.noreply.github.com>
2026-04-09 15:55:59 -03:00

14 KiB

Troubleshooting

🌐 Languages: 🇺🇸 English | 🇧🇷 Português (Brasil) | 🇪🇸 Español | 🇫🇷 Français | 🇮🇹 Italiano | 🇷🇺 Русский | 🇨🇳 中文 (简体) | 🇩🇪 Deutsch | 🇮🇳 हिन्दी | 🇹🇭 ไทย | 🇺🇦 Українська | 🇸🇦 العربية | 🇯🇵 日本語 | 🇻🇳 Tiếng Việt | 🇧🇬 Български | 🇩🇰 Dansk | 🇫🇮 Suomi | 🇮🇱 עברית | 🇭🇺 Magyar | 🇮🇩 Bahasa Indonesia | 🇰🇷 한국어 | 🇲🇾 Bahasa Melayu | 🇳🇱 Nederlands | 🇳🇴 Norsk | 🇵🇹 Português (Portugal) | 🇷🇴 Română | 🇵🇱 Polski | 🇸🇰 Slovenčina | 🇸🇪 Svenska | 🇵🇭 Filipino | 🇨🇿 Čeština

Common problems and solutions for OmniRoute.


Quick Fixes

Problem Solution
First login not working Set INITIAL_PASSWORD in .env (no hardcoded default)
Dashboard opens on wrong port Set PORT=20128 and NEXT_PUBLIC_BASE_URL=http://localhost:20128
No logs written to disk Set APP_LOG_TO_FILE=true and verify call log capture is enabled
EACCES: permission denied Set DATA_DIR=/path/to/writable/dir to override ~/.omniroute
Routing strategy not saving Update to v1.4.11+ (Zod schema fix for settings persistence)
Login crash / blank page You may be on Node.js 24+ — see Node.js Compatibility below
Proxy "fetch failed" Ensure proxy config is set at the correct level — see Proxy Issues below

Node.js Compatibility

Login page crashes or shows "Module self-registration" error

Cause: You are running Node.js 24+. The better-sqlite3 native binary is not compatible with Node.js 24, which causes a fatal crash when the server tries to initialize the database.

Symptoms:

  • Login page shows a blank screen or a server error
  • Console shows Error: Module did not self-register or similar native binding errors
  • Starting with v3.5.5, the login page shows an orange warning banner with your Node version if incompatibility is detected

Fix:

  1. Install Node.js 22 LTS (recommended):
    nvm install 22
    nvm use 22
    
  2. Verify your version: node --version should show v22.x.x
  3. Reinstall OmniRoute: npm install -g omniroute
  4. Restart: omniroute

Supported versions: Node.js 18, 20, or 22 LTS. Node.js 24+ is not supported.


Proxy Issues

Provider validation shows "fetch failed"

Cause: The API key validation endpoint (POST /api/providers/validate) was previously bypassing proxy configuration, causing failures in environments that require proxy routing.

Fix (v3.5.5+): This is now fixed. Provider validation routes through runWithProxyContext, honoring provider-level and global proxy settings automatically.

Token health check fails with "fetch failed"

Cause: Background OAuth token refresh was not resolving proxy configuration per connection.

Fix (v3.5.5+): The token health check scheduler now resolves proxy config per connection before attempting refresh. Update to v3.5.5+.

SOCKS5 proxy returns "invalid onRequestStart method"

Cause: On Node.js 22, the undici@8 dispatcher is incompatible with Node's built-in fetch() implementation.

Fix (v3.5.5+): OmniRoute now uses undici's own fetch() function when a proxy dispatcher is active, ensuring consistent behavior. Update to v3.5.5+.


Provider Issues

"Language model did not provide messages"

Cause: Provider quota exhausted.

Fix:

  1. Check dashboard quota tracker
  2. Use a combo with fallback tiers
  3. Switch to cheaper/free tier

Rate Limiting

Cause: Subscription quota exhausted.

Fix:

  • Add fallback: cc/claude-opus-4-6 → glm/glm-4.7 → if/kimi-k2-thinking
  • Use GLM/MiniMax as cheap backup

OAuth Token Expired

OmniRoute auto-refreshes tokens. If issues persist:

  1. Dashboard → Provider → Reconnect
  2. Delete and re-add the provider connection

Cloud Issues

Cloud Sync Errors

  1. Verify BASE_URL points to your running instance (e.g., http://localhost:20128)
  2. Verify CLOUD_URL points to your cloud endpoint (e.g., https://omniroute.dev)
  3. Keep NEXT_PUBLIC_* values aligned with server-side values

Cloud stream=false Returns 500

Symptom: Unexpected token 'd'... on cloud endpoint for non-streaming calls.

Cause: Upstream returns SSE payload while client expects JSON.

Workaround: Use stream=true for cloud direct calls. Local runtime includes SSE→JSON fallback.

Cloud Says Connected but "Invalid API key"

  1. Create a fresh key from local dashboard (/api/keys)
  2. Run cloud sync: Enable Cloud → Sync Now
  3. Old/non-synced keys can still return 401 on cloud

Docker Issues

CLI Tool Shows Not Installed

  1. Check runtime fields: curl http://localhost:20128/api/cli-tools/runtime/codex | jq
  2. For portable mode: use image target runner-cli (bundled CLIs)
  3. For host mount mode: set CLI_EXTRA_PATHS and mount host bin directory as read-only
  4. If installed=true and runnable=false: binary was found but failed healthcheck

Quick Runtime Validation

curl -s http://localhost:20128/api/cli-tools/codex-settings | jq '{installed,runnable,commandPath,runtimeMode,reason}'
curl -s http://localhost:20128/api/cli-tools/claude-settings | jq '{installed,runnable,commandPath,runtimeMode,reason}'
curl -s http://localhost:20128/api/cli-tools/openclaw-settings | jq '{installed,runnable,commandPath,runtimeMode,reason}'

Cost Issues

High Costs

  1. Check usage stats in Dashboard → Usage
  2. Switch primary model to GLM/MiniMax
  3. Use free tier (Gemini CLI, Qoder) for non-critical tasks
  4. Set cost budgets per API key: Dashboard → API Keys → Budget

Debugging

Enable Log Files

Set APP_LOG_TO_FILE=true in your .env file. Application logs are written under logs/. Request artifacts are stored under ${DATA_DIR}/call_logs/ when the call log pipeline is enabled in settings.

Check Provider Health

# Health dashboard
http://localhost:20128/dashboard/health

# API health check
curl http://localhost:20128/api/monitoring/health

Runtime Storage

  • Main state: ${DATA_DIR}/storage.sqlite (providers, combos, aliases, keys, settings)
  • Usage: SQLite tables in storage.sqlite (usage_history, call_logs, proxy_logs) + optional ${DATA_DIR}/call_logs/
  • Application logs: <repo>/logs/... (when APP_LOG_TO_FILE=true)
  • Call log artifacts: ${DATA_DIR}/call_logs/YYYY-MM-DD/... when the call log pipeline is enabled

Circuit Breaker Issues

Provider stuck in OPEN state

When a provider's circuit breaker is OPEN, requests are blocked until the cooldown expires.

Fix:

  1. Go to Dashboard → Settings → Resilience
  2. Check the circuit breaker card for the affected provider
  3. Click Reset All to clear all breakers, or wait for the cooldown to expire
  4. Verify the provider is actually available before resetting

Provider keeps tripping the circuit breaker

If a provider repeatedly enters OPEN state:

  1. Check Dashboard → Health → Provider Health for the failure pattern
  2. Go to Settings → Resilience → Provider Profiles and increase the failure threshold
  3. Check if the provider has changed API limits or requires re-authentication
  4. Review latency telemetry — high latency may cause timeout-based failures

Audio Transcription Issues

"Unsupported model" error

  • Ensure you're using the correct prefix: deepgram/nova-3 or assemblyai/best
  • Verify the provider is connected in Dashboard → Providers

Transcription returns empty or fails

  • Check supported audio formats: mp3, wav, m4a, flac, ogg, webm
  • Verify file size is within provider limits (typically < 25MB)
  • Check provider API key validity in the provider card

Translator Debugging

Use Dashboard → Translator to debug format translation issues:

Mode When to Use
Playground Compare input/output formats side by side — paste a failing request to see how it translates
Chat Tester Send live messages and inspect the full request/response payload including headers
Test Bench Run batch tests across format combinations to find which translations are broken
Live Monitor Watch real-time request flow to catch intermittent translation issues

Common format issues

  • Thinking tags not appearing — Check if the target provider supports thinking and the thinking budget setting
  • Tool calls dropping — Some format translations may strip unsupported fields; verify in Playground mode
  • System prompt missing — Claude and Gemini handle system prompts differently; check translation output
  • SDK returns raw string instead of object — Fixed in v1.1.0: response sanitizer now strips non-standard fields (x_groq, usage_breakdown, etc.) that cause OpenAI SDK Pydantic validation failures
  • GLM/ERNIE rejects system role — Fixed in v1.1.0: role normalizer automatically merges system messages into user messages for incompatible models
  • developer role not recognized — Fixed in v1.1.0: automatically converted to system for non-OpenAI providers
  • json_schema not working with Gemini — Fixed in v1.1.0: response_format is now converted to Gemini's responseMimeType + responseSchema

Resilience Settings

Auto rate-limit not triggering

  • Auto rate-limit only applies to API key providers (not OAuth/subscription)
  • Verify Settings → Resilience → Provider Profiles has auto-rate-limit enabled
  • Check if the provider returns 429 status codes or Retry-After headers

Tuning exponential backoff

Provider profiles support these settings:

  • Base delay — Initial wait time after first failure (default: 1s)
  • Max delay — Maximum wait time cap (default: 30s)
  • Multiplier — How much to increase delay per consecutive failure (default: 2x)

Anti-thundering herd

When many concurrent requests hit a rate-limited provider, OmniRoute uses mutex + auto rate-limiting to serialize requests and prevent cascading failures. This is automatic for API key providers.


Optional RAG / LLM failure taxonomy (16 problems)

Some OmniRoute users place the gateway in front of RAG or agent stacks. In those setups it is common to see a strange pattern: OmniRoute looks healthy (providers up, routing profiles ok, no rate limit alerts) but the final answer is still wrong.

In practice these incidents usually come from the downstream RAG pipeline, not from the gateway itself.

If you want a shared vocabulary to describe those failures you can use the WFGY ProblemMap, an external MIT license text resource that defines sixteen recurring RAG / LLM failure patterns. At a high level it covers:

  • retrieval drift and broken context boundaries
  • empty or stale indexes and vector stores
  • embedding versus semantic mismatch
  • prompt assembly and context window issues
  • logic collapse and overconfident answers
  • long chain and agent coordination failures
  • multi agent memory and role drift
  • deployment and bootstrap ordering problems

The idea is simple:

  1. When you investigate a bad response, capture:
    • user task and request
    • route or provider combo in OmniRoute
    • any RAG context used downstream (retrieved documents, tool calls, etc)
  2. Map the incident to one or two WFGY ProblemMap numbers (No.1No.16).
  3. Store the number in your own dashboard, runbook, or incident tracker next to the OmniRoute logs.
  4. Use the corresponding WFGY page to decide whether you need to change your RAG stack, retriever, or routing strategy.

Full text and concrete recipes live here (MIT license, text only):

WFGY ProblemMap README

You can ignore this section if you do not run RAG or agent pipelines behind OmniRoute.


Still Stuck?