SecurityShield Rules

Shield Rules

ClawNex’s Prompt Shield contains 155 rules organized into 10 categories. Every LLM prompt and response is scanned against these rules to detect threats.

Rule Categories

CategoryRulesSeverity RangeFalse Positive RiskNotes
secrets21CRITICAL-HIGHLowAPI keys, tokens, credentials. Very reliable.
commands19CRITICAL-HIGHLow-MediumShell commands, reverse shells. Can trigger on code-generation tasks.
trust-exploit18HIGH-MEDIUMMedium”Ignore previous instructions” and similar injection patterns.
sensitive-paths14HIGH-MEDIUMLowReferences to credential files (.ssh, .env, etc.).
c212CRITICAL-HIGHLowC2 beacons, webhook exfiltration, cloud metadata attacks.
cognitive-file8CRITICAL-HIGHHigh (internal)References to SOUL.md, MEMORY.md. Will false-positive on agent system prompts — use the whitelist.
jailbreak8HIGHMediumKnown jailbreak patterns (grandma exploit, token smuggling).
financial7CRITICAL-MEDIUMMediumCredit cards, SSNs, IBANs. FIN-SWIFT-CODE may match all-caps words.
steganography6HIGH-MEDIUMLowZero-width characters, homoglyphs, BIDI overrides.
encoding6MEDIUMMediumBase64, hex, ROT13 encoded payloads.

Scoring and Verdicts

Score = SUM(severity_weight x confidence x min(matchCount, 5))
  capped at 100

Severity weights: CRITICAL = 30, HIGH = 20, MEDIUM = 10, LOW = 5
ConditionVerdict
Any CRITICAL detectionBLOCK (regardless of score)
Score >= 60BLOCK
Score >= 25REVIEW
Score < 25ALLOW

Three-Layer Detection

LayerMechanismTimingCan Block?
Pre-call scanLiteLLM async_pre_call_hookBefore model sees the promptYes (if block mode is on)
Post-call scanLiteLLM log_success_eventAfter response receivedNo (informational)
Retroactive scanSession Watcher polls JSONL logsMinutes to hours laterNo (detection only)

Outbound Detection (Data Leak Prevention)

Applied to model responses:

PatternSeverity
Private key materialCRITICAL
Password assignmentsHIGH
Environment variable leaksMEDIUM
Internal IP addressesMEDIUM
Database connection URIsHIGH

Shield Modes

ModeBehavior
OBSERVE (default)Traffic is scanned and logged. Threats are flagged but not blocked.
BLOCKThreats that score BLOCK are actively rejected before reaching the AI model.

Start in OBSERVE mode. Review traffic for a few days to understand what your agents send. When confident the shield is not producing false positives on legitimate traffic, switch to BLOCK mode.