LLM-based guardrails (roadmap)¶
Status: Not yet implemented — design is in flight; this doc captures the shape we're aiming at so policy authors and contributors can reason about it before code lands.
OpenAgentLock's evaluator is intentionally deterministic YAML: a path-shape match plus a verdict. That is the right default — fast, auditable, no LLM-shaped failure modes in the hot path. But there are policy questions that genuinely cannot be decided by regex, and forcing them through command_regex produces either dangerous false negatives or unworkable false positives. Examples we've hit in the wild:
- "Block prompt-injected requests to leak credentials" — the malicious instruction may be in any tool input, in any wording.
- "Refuse to assist with self-harm or weapons synthesis" when the agent is wrapping a chat assistant.
- "Block exfiltration of company-confidential content" when the content is extracted from real source files (not a regex away).
For these we want a second tier of evaluator that calls into a small safety classifier model — running locally — and returns a numeric score that policy can act on.
Severity model¶
Each category in a guardrail verdict carries a severity from a fixed ordered scale:
Comparison operators (>=, >, =) in the policy's on_threshold block are evaluated against this ordering. The literal category: any matches if any category meets the threshold.
The classifier we wrap may emit raw boolean per-category flags rather than severities directly. The OpenAgentLock-side adapter normalizes those into the scale above using a per-model mapping table — for the NemoGuard models, a true flag maps to high by default, with overrides shipped in the model registry entry.
Reference: Llama 3.1 NemoGuard¶
The shape we're targeting is close to NVIDIA's llama-3.1-nemotron-safety-guard-8b-v3. Input: a tool-call or message payload + the agent's recent context. Output: structured JSON over a fixed taxonomy. The classifier emits booleans; the daemon-side adapter rewrites them into severity scores before the gate's on_threshold block evaluates:
// Classifier output, before adapter rewrite:
{
"safe": false,
"categories": {
"indiscriminate_weapons": true,
"privacy": false,
"self_harm": false
// ... etc
},
"rationale": "request mentions enrichment of fissile material..."
}
// After OpenAgentLock adapter rewrite — what the policy sees:
{
"safe": false,
"categories": {
"indiscriminate_weapons": { "severity": "high", "raw": true },
"privacy": { "severity": "none", "raw": false },
"self_harm": { "severity": "none", "raw": false }
},
"rationale": "request mentions enrichment of fissile material..."
}
A guardrail-typed gate then matches severity: ">= high" against indiscriminate_weapons and the category: any rule fires.
Proposed gate shape¶
gates:
- id: guardrail.injection-resistance
match:
tool: "*" # apply to every tool call
evaluate:
- kind: llm_guardrail
model: nemoguard-8b # local registry of models
on_threshold:
# any high-severity category trips deny
- category: any
severity: ">= high"
action: deny
# specific categories mapped explicitly
- category: privacy
severity: ">= medium"
action: deny
cache_ttl: 60s # avoid re-classifying identical payloads
max_latency: 400ms # cap; on timeout the rule abstains
Key invariants:
- Local-first. Model serving runs on the same host as the daemon. No prompts leave the box. Default backend is
ollama(which already supports the NeMo Guard models); we'll wirevllmandllama.cppas alternates. - Abstain, not deny, on failure. If the model is slow/unavailable, the rule abstains (returns
skip) rather than denying — guardrails must not become a DoS surface. Themonitorpolicy mode logs the abstention. - Deterministic core stays primary. Guardrail evaluators run after deterministic regex/glob matchers. The simple matchers absorb the easy cases; the LLM only sees what the policy explicitly routes to it.
- Same audit trail. Verdicts include the classifier output; ledger entries record
signer,model, and the structured taxonomy verdict so a verifier can reproduce the call.
Wire shape¶
A new evaluator kind in the policy schema:
- kind: llm_guardrail
model: <model-id> # required; resolved via /v1/llm/models
prompt_field: input.command # which path on the tool call to send
context_fields: # optional extra context
- input.file_path
- session.policy_hash
on_threshold: [...]
cache_ttl: 60s
max_latency: 400ms
Policies can mix-and-match — most rules will stay regex; a small number will route through the classifier when the regex tier reports an ambiguous match.
Rollout plan¶
| Step | Status |
|---|---|
| Roadmap doc (this file) | Shipped |
Add kind: llm_guardrail to the policy schema, parser only |
Not yet implemented |
/v1/llm/models registry endpoint |
Not yet implemented |
First evaluator implementation against ollama running NeMo Guard locally |
Not yet implemented |
| Latency / cache / abstain semantics + tests | Not yet implemented |
Community-rule shape (schema_version: 2 adds the kind) |
Not yet implemented |
| Dashboard UI shows model verdicts in the event row | Not yet implemented |
Why a roadmap doc instead of just an issue¶
Guardrail rules will look very different from regex rules — they ship the model id, threshold table, and a context-extraction recipe. We want contributors to the openagentlock/rules registry to be able to author these without running into "is the schema settled yet?" — and to be able to author against a stable wire shape before the evaluator lands so the rules are ready when the daemon ships support.
If you're interested in helping land any of the not-yet-implemented rows, file an issue on the main repo. The two highest-leverage starting points are the /v1/llm/models registry and an ollama-backed first evaluator.