LLM-based guardrails (roadmap)¶
Status: Initial external-provider slice shipped — OpenAgentLock now has startup-env provider configuration, catalog discovery, runtime classifier evaluation for NVIDIA-style guardrails, and web/TUI visibility. Broader policy-schema integration and community-rule contribution flows remain roadmap.
OpenAgentLock's evaluator is intentionally deterministic YAML: a path-shape match plus a verdict. That is the right default — fast, auditable, no LLM-shaped failure modes in the hot path. But there are policy questions that genuinely cannot be decided by regex, and forcing them through command_regex produces either dangerous false negatives or unworkable false positives. Examples we've hit in the wild:
- "Block prompt-injected requests to leak credentials" — the malicious instruction may be in any tool input, in any wording.
- "Refuse to assist with self-harm or weapons synthesis" when the agent is wrapping a chat assistant.
- "Block exfiltration of company-confidential content" when the content is extracted from real source files (not a regex away).
For these we want a second tier of evaluator that calls into a safety classifier model and returns a structured result that OpenAgentLock can audit. The first shipped slice supports external hosted providers as an explicit opt-in. Local-only operation remains the default when no provider is configured.
Severity model¶
Each category in a guardrail verdict carries a severity from a fixed ordered scale:
Comparison operators (>=, >, =) in the policy's on_threshold block are evaluated against this ordering. The literal category: any matches if any category meets the threshold.
The classifier we wrap may emit raw boolean per-category flags rather than severities directly. The OpenAgentLock-side adapter normalizes those into the scale above using a per-model mapping table — for the NemoGuard models, a true flag maps to high by default, with overrides shipped in the model registry entry.
Reference: Llama 3.1 NemoGuard¶
The shape we're targeting is close to NVIDIA's llama-3.1-nemotron-safety-guard-8b-v3. Input: a tool-call or message payload + the agent's recent context. Output: structured JSON over a fixed taxonomy. The classifier emits booleans; the daemon-side adapter rewrites them into severity scores before the gate's on_threshold block evaluates:
// Classifier output, before adapter rewrite:
{
"safe": false,
"categories": {
"indiscriminate_weapons": true,
"privacy": false,
"self_harm": false
// ... etc
},
"rationale": "request mentions enrichment of fissile material..."
}
// After OpenAgentLock adapter rewrite — what the policy sees:
{
"safe": false,
"categories": {
"indiscriminate_weapons": { "severity": "high", "raw": true },
"privacy": { "severity": "none", "raw": false },
"self_harm": { "severity": "none", "raw": false }
},
"rationale": "request mentions enrichment of fissile material..."
}
A guardrail-typed gate then matches severity: ">= high" against indiscriminate_weapons and the category: any rule fires.
Proposed gate shape¶
gates:
- id: guardrail.injection-resistance
match:
tool: "*" # apply to every tool call
evaluate:
- kind: llm_guardrail
model: nemoguard-8b # local registry of models
on_threshold:
# any high-severity category trips deny
- category: any
severity: ">= high"
action: deny
# specific categories mapped explicitly
- category: privacy
severity: ">= medium"
action: deny
cache_ttl: 60s # avoid re-classifying identical payloads
max_latency: 400ms # cap; on timeout the rule abstains
Key invariants:
- Local policy first. Deterministic YAML policy runs before external guardrails. Local deny stops immediately; external guardrails only run after a local allow.
- Abstain, not deny, on failure. If the model is slow/unavailable, the rule abstains (returns
skip) rather than denying — guardrails must not become a DoS surface. Themonitorpolicy mode logs the abstention. - Deterministic core stays primary. Guardrail evaluators run after deterministic regex/glob matchers. The simple matchers absorb the easy cases; the LLM only sees what the policy explicitly routes to it.
- Same audit trail. Verdicts keep local policy, guardrail, and final verdicts distinct. A guardrail deny is not presented as a deterministic YAML rule hit.
Shipped external-provider slice¶
The daemon exposes:
GET /v1/guardrails/providersPOST /v1/guardrails/providers/{id}/testGET /v1/guardrails/catalogPUT /v1/guardrails/enabledGET /v1/guardrails/traces/{seq}
Provider credentials are read from environment variables when the control plane starts, not from the web dashboard or CLI:
The daemon stores these keys in RAM only. They are never written by the dashboard and are cleared on daemon restart.
Provider behavior:
- NVIDIA catalog entries are normalized as
classifier_modelentries and can run in the post-local-policy runtime stage. - OpenRouter guardrails are normalized as
account_policyentries. They are visible in catalog surfaces but do not run as OpenAgentLock runtime classifiers in this slice, so they should be treated as catalog visibility rather than enforcement. - Provider errors, unsupported runtime entries, and malformed classifier responses produce
abstain, not implicit allow or deny.
Wire shape¶
A new evaluator kind in the policy schema:
- kind: llm_guardrail
model: <model-id> # required; resolved via /v1/llm/models
prompt_field: input.command # which path on the tool call to send
context_fields: # optional extra context
- input.file_path
- session.policy_hash
on_threshold: [...]
cache_ttl: 60s
max_latency: 400ms
Policies can mix-and-match — most rules will stay regex; a small number will route through the classifier when the regex tier reports an ambiguous match.
Rollout plan¶
| Step | Status |
|---|---|
| Roadmap doc (this file) | Shipped |
/v1/guardrails/providers provider registry endpoint |
Shipped |
/v1/guardrails/catalog normalized provider catalog |
Shipped |
| NVIDIA runtime classifier integration | Shipped |
| OpenRouter account-policy catalog visibility | Shipped |
| Startup-env RAM provider key configuration | Shipped |
| Catalog cache / abstain semantics + tests | Shipped |
| Provider-measured runtime latency in traces | Not yet implemented |
| Dashboard and TUI multi-stage trace visibility | Shipped |
Add kind: llm_guardrail to the policy schema, parser only |
Not yet implemented |
Local ollama / vllm / llama.cpp backends |
Not yet implemented |
Community-rule shape (schema_version: 2 adds the kind) |
Not yet implemented |
| Opt-in community contribution pipeline | Not yet implemented |
Why a roadmap doc instead of just an issue¶
Guardrail rules will look very different from regex rules — they ship the model id, threshold table, and a context-extraction recipe. We want contributors to the openagentlock/rules registry to be able to author these without running into "is the schema settled yet?" — and to be able to author against a stable wire shape before the evaluator lands so the rules are ready when the daemon ships support.
If you're interested in helping land any of the not-yet-implemented rows, file an issue on the main repo. The two highest-leverage starting points are the /v1/llm/models registry and an ollama-backed first evaluator.