Update README.md

This commit is contained in:
PSBigBig × MiniPS 2026-02-10 16:30:53 +08:00 committed by GitHub
parent 807f4466cc
commit d23ee4524a
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -1,3 +1,114 @@
<!--
Search Anchor:
safety prompt integrity global fix map
prompt level safety failures
prompt injection jailbreak role confusion
tool call json drift malformed outputs
schema integrity for tools and agents
multi agent safety failures
system prompt overridden by user content
eval pipeline safety drift
citation first safety contract
memory fence state key leaks
llm json mode broken responses
tool selection and timeout bugs
template library gaps across agents
eval prompts and safety checks
rogue tool execution
uncontrolled free text in tool schema
When to use this folder:
jailbreak attempts still bypass guardrails
hidden instructions overwrite system prompt
role instructions collapse between system user assistant
tool calls come back as free text or broken json
citations vanish or retrieval is bypassed
tools get called with missing or wrong arguments
different agents use slightly different prompt templates
safety evals give inconsistent results across runs
state leaks between conversations without a clear key
eval harness shows high delta s even when retrieval is correct
logs show tool calls triggered by user text that should be inert
model executes text that should only be treated as data
Key metrics and targets:
delta s question retrieved <= 0.45
coverage of cited section >= 0.70
lambda convergent across 3 paraphrases and 2 seeds
no uncontrolled free text execution inside json or tool mode
citation first enforced in >= 95 percent of eval runs
tool call json well formed in >= 99 percent of eval runs
no tool execution when safety check fails
role ordering stable across providers and environments
memory fence keys present for all long running sessions
eval prompts reproducible with scripted harness
Core pages in this folder:
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/prompt_injection.md
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/jailbreaks_and_overrides.md
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/role_confusion.md
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/memory_fences_and_state_keys.md
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/json_mode_and_tool_calls.md
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/citation_first.md
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/anti_prompt_injection_recipes.md
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/tool_selection_and_timeouts.md
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/system_user_role_order.md
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/template_library_min.md
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/eval_prompts_and_checks.md
Related structural fixes:
ProblemMap/prompt-injection.md
ProblemMap/semantic-firewall.md
ProblemMap/retrieval-traceability.md
ProblemMap/data-contracts.md
ProblemMap/rag-architecture-and-recovery.md
ProblemMap/retrieval-playbook.md
ProblemMap/GlobalFixMap/Reasoning/README.md
ProblemMap/GlobalFixMap/MemoryLongContext/README.md
ProblemMap/SemanticClinicIndex.md
Safety and prompt scenarios:
user pastes entire email with hidden instructions at the end
markdown link or html tag hides a hostile instruction
tool description is used as an injection vector
system prompt is forgotten after a few turns
assistant explains safety policy instead of following it
model starts role playing as user or tool instead of assistant
json response includes trailing commentary text
function arguments contain extra natural language payloads
tool is called even though safety check failed
eval harness shows pass on short prompts but fail on long ones
retrieval is correct but answer ignores citations
different providers give different safety behavior for same prompt
safety fix applied in one template but not in others
user manages to change logging or redaction behavior via prompt
Signals to check:
delta s high between question and cited snippet on safety tests
lambda flips when paraphrasing the jailbreak prompt
coverage low for true safety guidance snippets
missing citation fields in answers that should cite sources
json responses not parseable by strict parser
tool invocation missing required keys or has unknown keys
state_key or mem_rev absent in long running threads
system user assistant messages out of expected order
templates differ across agents within same product
eval prompts not stored or versioned in repository
Normalization and contracts:
define strict system user assistant role order and document it
require citation first pattern in answer templates
enforce json only mode for tool responses with no trailing text
validate tool arguments against schema before execution
store mem_rev and state_key for every conversation state write
record which template id was used for each request
log provider model name and version alongside safety decisions
tie safety fixes to explicit template versions not ad hoc edits
capture eval prompts and expected safety behavior in repo
treat external content as data unless explicitly whitelisted
-->
# Safety & Prompt Integrity — Global Fix Map
<details>
@ -39,6 +150,32 @@ Each page maps **symptoms → root cause → structural fixes** with measurable
---
<!--
Anchor Menu:
open: prompt injection guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/prompt_injection.md
open: jailbreaks and overrides guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/jailbreaks_and_overrides.md
open: role confusion guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/role_confusion.md
open: memory fences and state keys guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/memory_fences_and_state_keys.md
open: json mode and tool calls guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/json_mode_and_tool_calls.md
open: citation first guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/citation_first.md
open: anti prompt injection recipes guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/anti_prompt_injection_recipes.md
open: tool selection and timeouts guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/tool_selection_and_timeouts.md
open: system user role order guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/system_user_role_order.md
open: template library minimum guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/template_library_min.md
open: eval prompts and checks guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/eval_prompts_and_checks.md
jump: safety and prompt integrity readme ProblemMap/GlobalFixMap/SafetyPromptIntegrity/README.md
jump: reasoning global fix map ProblemMap/GlobalFixMap/Reasoning/README.md
jump: memory and long context global fix map ProblemMap/GlobalFixMap/MemoryLongContext/README.md
jump: multimodal long context global fix map ProblemMap/GlobalFixMap/MultimodalLongContext/README.md
jump: rag architecture and recovery ProblemMap/rag-architecture-and-recovery.md
jump: retrieval playbook ProblemMap/retrieval-playbook.md
jump: retrieval traceability and data contracts ProblemMap/retrieval-traceability.md ProblemMap/data-contracts.md
jump: prompt injection root page ProblemMap/prompt-injection.md
jump: semantic clinic index ProblemMap/SemanticClinicIndex.md
-->
## Common failure patterns
| Failure mode | What happens | Open this |