Shield Rules

ClawNex’s Prompt Shield contains 155 rules organized into 10 categories. Every LLM prompt and response is scanned against these rules to detect threats.

Rule Categories

Category	Rules	Severity Range	False Positive Risk	Notes
secrets	21	CRITICAL-HIGH	Low	API keys, tokens, credentials. Very reliable.
commands	19	CRITICAL-HIGH	Low-Medium	Shell commands, reverse shells. Can trigger on code-generation tasks.
trust-exploit	18	HIGH-MEDIUM	Medium	”Ignore previous instructions” and similar injection patterns.
sensitive-paths	14	HIGH-MEDIUM	Low	References to credential files (.ssh, .env, etc.).
c2	12	CRITICAL-HIGH	Low	C2 beacons, webhook exfiltration, cloud metadata attacks.
cognitive-file	8	CRITICAL-HIGH	High (internal)	References to SOUL.md, MEMORY.md. Will false-positive on agent system prompts — use the whitelist.
jailbreak	8	HIGH	Medium	Known jailbreak patterns (grandma exploit, token smuggling).
financial	7	CRITICAL-MEDIUM	Medium	Credit cards, SSNs, IBANs. FIN-SWIFT-CODE may match all-caps words.
steganography	6	HIGH-MEDIUM	Low	Zero-width characters, homoglyphs, BIDI overrides.
encoding	6	MEDIUM	Medium	Base64, hex, ROT13 encoded payloads.

Scoring and Verdicts

Score = SUM(severity_weight x confidence x min(matchCount, 5))
  capped at 100

Severity weights: CRITICAL = 30, HIGH = 20, MEDIUM = 10, LOW = 5

Condition	Verdict
Any CRITICAL detection	BLOCK (regardless of score)
Score >= 60	BLOCK
Score >= 25	REVIEW
Score < 25	ALLOW

Three-Layer Detection

Layer	Mechanism	Timing	Can Block?
Pre-call scan	LiteLLM `async_pre_call_hook`	Before model sees the prompt	Yes (if block mode is on)
Post-call scan	LiteLLM `log_success_event`	After response received	No (informational)
Retroactive scan	Session Watcher polls JSONL logs	Minutes to hours later	No (detection only)

Outbound Detection (Data Leak Prevention)

Applied to model responses:

Pattern	Severity
Private key material	CRITICAL
Password assignments	HIGH
Environment variable leaks	MEDIUM
Internal IP addresses	MEDIUM
Database connection URIs	HIGH

Shield Modes

Mode	Behavior
OBSERVE (default)	Traffic is scanned and logged. Threats are flagged but not blocked.
BLOCK	Threats that score BLOCK are actively rejected before reaching the AI model.

Start in OBSERVE mode. Review traffic for a few days to understand what your agents send. When confident the shield is not producing false positives on legitimate traffic, switch to BLOCK mode.

Help & About RBAC & Roles