mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 11:40:07 +00:00
Update README.md
This commit is contained in:
parent
807f4466cc
commit
d23ee4524a
1 changed files with 137 additions and 0 deletions
|
|
@ -1,3 +1,114 @@
|
|||
<!--
|
||||
Search Anchor:
|
||||
safety prompt integrity global fix map
|
||||
prompt level safety failures
|
||||
prompt injection jailbreak role confusion
|
||||
tool call json drift malformed outputs
|
||||
schema integrity for tools and agents
|
||||
multi agent safety failures
|
||||
system prompt overridden by user content
|
||||
eval pipeline safety drift
|
||||
citation first safety contract
|
||||
memory fence state key leaks
|
||||
llm json mode broken responses
|
||||
tool selection and timeout bugs
|
||||
template library gaps across agents
|
||||
eval prompts and safety checks
|
||||
rogue tool execution
|
||||
uncontrolled free text in tool schema
|
||||
|
||||
When to use this folder:
|
||||
jailbreak attempts still bypass guardrails
|
||||
hidden instructions overwrite system prompt
|
||||
role instructions collapse between system user assistant
|
||||
tool calls come back as free text or broken json
|
||||
citations vanish or retrieval is bypassed
|
||||
tools get called with missing or wrong arguments
|
||||
different agents use slightly different prompt templates
|
||||
safety evals give inconsistent results across runs
|
||||
state leaks between conversations without a clear key
|
||||
eval harness shows high delta s even when retrieval is correct
|
||||
logs show tool calls triggered by user text that should be inert
|
||||
model executes text that should only be treated as data
|
||||
|
||||
Key metrics and targets:
|
||||
delta s question retrieved <= 0.45
|
||||
coverage of cited section >= 0.70
|
||||
lambda convergent across 3 paraphrases and 2 seeds
|
||||
no uncontrolled free text execution inside json or tool mode
|
||||
citation first enforced in >= 95 percent of eval runs
|
||||
tool call json well formed in >= 99 percent of eval runs
|
||||
no tool execution when safety check fails
|
||||
role ordering stable across providers and environments
|
||||
memory fence keys present for all long running sessions
|
||||
eval prompts reproducible with scripted harness
|
||||
|
||||
Core pages in this folder:
|
||||
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/prompt_injection.md
|
||||
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/jailbreaks_and_overrides.md
|
||||
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/role_confusion.md
|
||||
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/memory_fences_and_state_keys.md
|
||||
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/json_mode_and_tool_calls.md
|
||||
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/citation_first.md
|
||||
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/anti_prompt_injection_recipes.md
|
||||
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/tool_selection_and_timeouts.md
|
||||
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/system_user_role_order.md
|
||||
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/template_library_min.md
|
||||
ProblemMap/GlobalFixMap/SafetyPromptIntegrity/eval_prompts_and_checks.md
|
||||
|
||||
Related structural fixes:
|
||||
ProblemMap/prompt-injection.md
|
||||
ProblemMap/semantic-firewall.md
|
||||
ProblemMap/retrieval-traceability.md
|
||||
ProblemMap/data-contracts.md
|
||||
ProblemMap/rag-architecture-and-recovery.md
|
||||
ProblemMap/retrieval-playbook.md
|
||||
ProblemMap/GlobalFixMap/Reasoning/README.md
|
||||
ProblemMap/GlobalFixMap/MemoryLongContext/README.md
|
||||
ProblemMap/SemanticClinicIndex.md
|
||||
|
||||
Safety and prompt scenarios:
|
||||
user pastes entire email with hidden instructions at the end
|
||||
markdown link or html tag hides a hostile instruction
|
||||
tool description is used as an injection vector
|
||||
system prompt is forgotten after a few turns
|
||||
assistant explains safety policy instead of following it
|
||||
model starts role playing as user or tool instead of assistant
|
||||
json response includes trailing commentary text
|
||||
function arguments contain extra natural language payloads
|
||||
tool is called even though safety check failed
|
||||
eval harness shows pass on short prompts but fail on long ones
|
||||
retrieval is correct but answer ignores citations
|
||||
different providers give different safety behavior for same prompt
|
||||
safety fix applied in one template but not in others
|
||||
user manages to change logging or redaction behavior via prompt
|
||||
|
||||
Signals to check:
|
||||
delta s high between question and cited snippet on safety tests
|
||||
lambda flips when paraphrasing the jailbreak prompt
|
||||
coverage low for true safety guidance snippets
|
||||
missing citation fields in answers that should cite sources
|
||||
json responses not parseable by strict parser
|
||||
tool invocation missing required keys or has unknown keys
|
||||
state_key or mem_rev absent in long running threads
|
||||
system user assistant messages out of expected order
|
||||
templates differ across agents within same product
|
||||
eval prompts not stored or versioned in repository
|
||||
|
||||
Normalization and contracts:
|
||||
define strict system user assistant role order and document it
|
||||
require citation first pattern in answer templates
|
||||
enforce json only mode for tool responses with no trailing text
|
||||
validate tool arguments against schema before execution
|
||||
store mem_rev and state_key for every conversation state write
|
||||
record which template id was used for each request
|
||||
log provider model name and version alongside safety decisions
|
||||
tie safety fixes to explicit template versions not ad hoc edits
|
||||
capture eval prompts and expected safety behavior in repo
|
||||
treat external content as data unless explicitly whitelisted
|
||||
-->
|
||||
|
||||
|
||||
# Safety & Prompt Integrity — Global Fix Map
|
||||
|
||||
<details>
|
||||
|
|
@ -39,6 +150,32 @@ Each page maps **symptoms → root cause → structural fixes** with measurable
|
|||
|
||||
---
|
||||
|
||||
<!--
|
||||
Anchor Menu:
|
||||
open: prompt injection guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/prompt_injection.md
|
||||
open: jailbreaks and overrides guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/jailbreaks_and_overrides.md
|
||||
open: role confusion guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/role_confusion.md
|
||||
open: memory fences and state keys guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/memory_fences_and_state_keys.md
|
||||
open: json mode and tool calls guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/json_mode_and_tool_calls.md
|
||||
open: citation first guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/citation_first.md
|
||||
open: anti prompt injection recipes guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/anti_prompt_injection_recipes.md
|
||||
open: tool selection and timeouts guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/tool_selection_and_timeouts.md
|
||||
open: system user role order guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/system_user_role_order.md
|
||||
open: template library minimum guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/template_library_min.md
|
||||
open: eval prompts and checks guide ProblemMap/GlobalFixMap/SafetyPromptIntegrity/eval_prompts_and_checks.md
|
||||
|
||||
jump: safety and prompt integrity readme ProblemMap/GlobalFixMap/SafetyPromptIntegrity/README.md
|
||||
jump: reasoning global fix map ProblemMap/GlobalFixMap/Reasoning/README.md
|
||||
jump: memory and long context global fix map ProblemMap/GlobalFixMap/MemoryLongContext/README.md
|
||||
jump: multimodal long context global fix map ProblemMap/GlobalFixMap/MultimodalLongContext/README.md
|
||||
jump: rag architecture and recovery ProblemMap/rag-architecture-and-recovery.md
|
||||
jump: retrieval playbook ProblemMap/retrieval-playbook.md
|
||||
jump: retrieval traceability and data contracts ProblemMap/retrieval-traceability.md ProblemMap/data-contracts.md
|
||||
jump: prompt injection root page ProblemMap/prompt-injection.md
|
||||
jump: semantic clinic index ProblemMap/SemanticClinicIndex.md
|
||||
-->
|
||||
|
||||
|
||||
## Common failure patterns
|
||||
|
||||
| Failure mode | What happens | Open this |
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue