Shield Rules
ClawNex’s Prompt Shield contains 155 rules organized into 10 categories. Every LLM prompt and response is scanned against these rules to detect threats.
Rule Categories
| Category | Rules | Severity Range | False Positive Risk | Notes |
|---|---|---|---|---|
| secrets | 21 | CRITICAL-HIGH | Low | API keys, tokens, credentials. Very reliable. |
| commands | 19 | CRITICAL-HIGH | Low-Medium | Shell commands, reverse shells. Can trigger on code-generation tasks. |
| trust-exploit | 18 | HIGH-MEDIUM | Medium | ”Ignore previous instructions” and similar injection patterns. |
| sensitive-paths | 14 | HIGH-MEDIUM | Low | References to credential files (.ssh, .env, etc.). |
| c2 | 12 | CRITICAL-HIGH | Low | C2 beacons, webhook exfiltration, cloud metadata attacks. |
| cognitive-file | 8 | CRITICAL-HIGH | High (internal) | References to SOUL.md, MEMORY.md. Will false-positive on agent system prompts — use the whitelist. |
| jailbreak | 8 | HIGH | Medium | Known jailbreak patterns (grandma exploit, token smuggling). |
| financial | 7 | CRITICAL-MEDIUM | Medium | Credit cards, SSNs, IBANs. FIN-SWIFT-CODE may match all-caps words. |
| steganography | 6 | HIGH-MEDIUM | Low | Zero-width characters, homoglyphs, BIDI overrides. |
| encoding | 6 | MEDIUM | Medium | Base64, hex, ROT13 encoded payloads. |
Scoring and Verdicts
Score = SUM(severity_weight x confidence x min(matchCount, 5))
capped at 100
Severity weights: CRITICAL = 30, HIGH = 20, MEDIUM = 10, LOW = 5| Condition | Verdict |
|---|---|
| Any CRITICAL detection | BLOCK (regardless of score) |
| Score >= 60 | BLOCK |
| Score >= 25 | REVIEW |
| Score < 25 | ALLOW |
Three-Layer Detection
| Layer | Mechanism | Timing | Can Block? |
|---|---|---|---|
| Pre-call scan | LiteLLM async_pre_call_hook | Before model sees the prompt | Yes (if block mode is on) |
| Post-call scan | LiteLLM log_success_event | After response received | No (informational) |
| Retroactive scan | Session Watcher polls JSONL logs | Minutes to hours later | No (detection only) |
Outbound Detection (Data Leak Prevention)
Applied to model responses:
| Pattern | Severity |
|---|---|
| Private key material | CRITICAL |
| Password assignments | HIGH |
| Environment variable leaks | MEDIUM |
| Internal IP addresses | MEDIUM |
| Database connection URIs | HIGH |
Shield Modes
| Mode | Behavior |
|---|---|
| OBSERVE (default) | Traffic is scanned and logged. Threats are flagged but not blocked. |
| BLOCK | Threats that score BLOCK are actively rejected before reaching the AI model. |
Start in OBSERVE mode. Review traffic for a few days to understand what your agents send. When confident the shield is not producing false positives on legitimate traffic, switch to BLOCK mode.