Update README.md

2026-04-28 11:40:07 +00:00 · 2026-02-10 16:30:53 +08:00 · 2026-02-10 16:30:53 +08:00 · d23ee4524a
commit d23ee4524a
parent 807f4466cc
1 changed files with 137 additions and 0 deletions
--- a/ProblemMap/GlobalFixMap/Safety_PromptIntegrity/README.md
+++ b/ProblemMap/GlobalFixMap/Safety_PromptIntegrity/README.md
@ -1,3 +1,114 @@
+<!--
+Search Anchor:
+safety prompt integrity global fix map
+prompt level safety failures
+prompt injection jailbreak role confusion
+tool call json drift malformed outputs
+schema integrity for tools and agents
+multi agent safety failures
+system prompt overridden by user content
+eval pipeline safety drift
+citation first safety contract
+memory fence state key leaks
+llm json mode broken responses
+tool selection and timeout bugs
+template library gaps across agents
+eval prompts and safety checks
+rogue tool execution
+uncontrolled free text in tool schema
+
+When to use this folder:
+jailbreak attempts still bypass guardrails
+hidden instructions overwrite system prompt
+role instructions collapse between system user assistant
+tool calls come back as free text or broken json
+citations vanish or retrieval is bypassed
+tools get called with missing or wrong arguments
+different agents use slightly different prompt templates
+safety evals give inconsistent results across runs
+state leaks between conversations without a clear key
+eval harness shows high delta s even when retrieval is correct
+logs show tool calls triggered by user text that should be inert
+model executes text that should only be treated as data
+
+Key metrics and targets:
+delta s question retrieved <= 0.45
+coverage of cited section >= 0.70
+lambda convergent across 3 paraphrases and 2 seeds
+no uncontrolled free text execution inside json or tool mode
+citation first enforced in >= 95 percent of eval runs
+tool call json well formed in >= 99 percent of eval runs
+no tool execution when safety check fails
+role ordering stable across providers and environments
+memory fence keys present for all long running sessions
+eval prompts reproducible with scripted harness
+
+Core pages in this folder:
+ProblemMap/GlobalFixMap/SafetyPromptIntegrity/prompt_injection.md
+ProblemMap/GlobalFixMap/SafetyPromptIntegrity/jailbreaks_and_overrides.md
+ProblemMap/GlobalFixMap/SafetyPromptIntegrity/role_confusion.md
+ProblemMap/GlobalFixMap/SafetyPromptIntegrity/memory_fences_and_state_keys.md
+ProblemMap/GlobalFixMap/SafetyPromptIntegrity/json_mode_and_tool_calls.md
+ProblemMap/GlobalFixMap/SafetyPromptIntegrity/citation_first.md
+ProblemMap/GlobalFixMap/SafetyPromptIntegrity/anti_prompt_injection_recipes.md
+ProblemMap/GlobalFixMap/SafetyPromptIntegrity/tool_selection_and_timeouts.md
+ProblemMap/GlobalFixMap/SafetyPromptIntegrity/system_user_role_order.md
+ProblemMap/GlobalFixMap/SafetyPromptIntegrity/template_library_min.md
+ProblemMap/GlobalFixMap/SafetyPromptIntegrity/eval_prompts_and_checks.md
+
+Related structural fixes:
+ProblemMap/prompt-injection.md
+ProblemMap/semantic-firewall.md
+ProblemMap/retrieval-traceability.md
+ProblemMap/data-contracts.md
+ProblemMap/rag-architecture-and-recovery.md
+ProblemMap/retrieval-playbook.md
+ProblemMap/GlobalFixMap/Reasoning/README.md
+ProblemMap/GlobalFixMap/MemoryLongContext/README.md
+ProblemMap/SemanticClinicIndex.md
+
+Safety and prompt scenarios:
+user pastes entire email with hidden instructions at the end
+markdown link or html tag hides a hostile instruction
+tool description is used as an injection vector
+system prompt is forgotten after a few turns
+assistant explains safety policy instead of following it
+model starts role playing as user or tool instead of assistant
+json response includes trailing commentary text
+function arguments contain extra natural language payloads
+tool is called even though safety check failed
+eval harness shows pass on short prompts but fail on long ones
+retrieval is correct but answer ignores citations
+different providers give different safety behavior for same prompt
+safety fix applied in one template but not in others
+user manages to change logging or redaction behavior via prompt
+
+Signals to check:
+delta s high between question and cited snippet on safety tests
+lambda flips when paraphrasing the jailbreak prompt
+coverage low for true safety guidance snippets
+missing citation fields in answers that should cite sources
+json responses not parseable by strict parser
+tool invocation missing required keys or has unknown keys
+state_key or mem_rev absent in long running threads
+system user assistant messages out of expected order
+templates differ across agents within same product
+eval prompts not stored or versioned in repository
+
+Normalization and contracts:
+define strict system user assistant role order and document it
+require citation first pattern in answer templates
+enforce json only mode for tool responses with no trailing text
+validate tool arguments against schema before execution
+store mem_rev and state_key for every conversation state write
+record which template id was used for each request
+log provider model name and version alongside safety decisions
+tie safety fixes to explicit template versions not ad hoc edits
+capture eval prompts and expected safety behavior in repo
+treat external content as data unless explicitly whitelisted
+-->
+
+
 # Safety & Prompt Integrity — Global Fix Map

 <details>
@ -39,6 +150,32 @@ Each page maps **symptoms → root cause → structural fixes** with measurable

 ---

+<!--
+Anchor Menu:
+open: prompt injection guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/prompt_injection.md
+open: jailbreaks and overrides guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/jailbreaks_and_overrides.md
+open: role confusion guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/role_confusion.md
+open: memory fences and state keys guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/memory_fences_and_state_keys.md
+open: json mode and tool calls guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/json_mode_and_tool_calls.md
+open: citation first guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/citation_first.md
+open: anti prompt injection recipes guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/anti_prompt_injection_recipes.md
+open: tool selection and timeouts guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/tool_selection_and_timeouts.md
+open: system user role order guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/system_user_role_order.md
+open: template library minimum guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/template_library_min.md
+open: eval prompts and checks guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/eval_prompts_and_checks.md
+
+jump: safety and prompt integrity readme ProblemMap/GlobalFixMap/SafetyPromptIntegrity/README.md
+jump: reasoning global fix map ProblemMap/GlobalFixMap/Reasoning/README.md
+jump: memory and long context global fix map ProblemMap/GlobalFixMap/MemoryLongContext/README.md
+jump: multimodal long context global fix map ProblemMap/GlobalFixMap/MultimodalLongContext/README.md
+jump: rag architecture and recovery ProblemMap/rag-architecture-and-recovery.md
+jump: retrieval playbook ProblemMap/retrieval-playbook.md
+jump: retrieval traceability and data contracts ProblemMap/retrieval-traceability.md ProblemMap/data-contracts.md
+jump: prompt injection root page ProblemMap/prompt-injection.md
+jump: semantic clinic index ProblemMap/SemanticClinicIndex.md
+-->
+
+
 ## Common failure patterns

 | Failure mode | What happens | Open this |