🛡️ Guardrail Arena
LLM Safety Guardrail Training Environment — OpenEnv-spec RL environment for training content safety agents
An RL environment where agents learn to moderate content safety: observe user prompts with conversation history and context, then choose allow / refuse / modify / escalate. Rewards are asymmetric: missing an attack costs 2× more than over-blocking. Dynamic user risk scores compound future penalties for missed attacks. Tasks 3 & 4 require adversarial trajectory reasoning — no stateless classifier can solve them.
Total Episodes Run36
Tasks4
Oracle Score1.0000
LLM Baseline0.9857
Tasks
| Task ID | Difficulty | Dataset | All-Allow | All-Refuse | LLM Baseline |
| basic_threat_detection | Easy | 67 eval / 131 train | 0.3750 | 0.3534 | 0.9857 (llama-3.3-70b) |
| context_aware_policy | Medium | 83 eval / 124 train | 0.4037 | 0.3460 | 0.7713 (llama-3.3-70b) |
| multiturn_adversarial | Hard | 238+ turns / 39 convos | 0.1607 | 0.0688 | 0.6866 (llama-3.3-70b) |
| adversarial_adaptation | Expert | 12 dynamic / session | 0.1500 | 0.0001 | 0.4820 (gpt-4o-mini) |
What Makes This Different
| Benchmark | Single-turn | Multi-turn | Adversarial adaptation | Training environment |
| WildGuard | ✅ | ❌ | ❌ | ❌ |
| HarmBench | ✅ | ❌ | ❌ | ❌ |
| AdvBench | ✅ | ❌ | ❌ | ❌ |
| Guardrail Arena | ✅ | ✅ | ✅ | ✅ |
Leaderboard (Top 5 per Task)
| basic_threat_detection | context_aware_policy |
| # | Agent | Score | # | Agent | Score |
| #1 | Oracle-Agent | 0.9999 | #1 | Oracle-Agent | 0.9999 |
| #2 | llama-3.3-70b-versatile | 0.9857 | #2 | llama-3.3-70b-versatile | 0.7713 |
| #3 | Qwen-3-235B-Cerebras | 0.9857 | #3 | GPT-4o-mini-ZeroShot | 0.7512 |
| #4 | Qwen-3-235B-Cerebras | 0.9857 | #4 | Qwen-3-235B-Cerebras | 0.6862 |
| #5 | GPT-4o-mini-ZeroShot | 0.9216 | #5 | Qwen-3-235B-Cerebras | 0.6862 |
| multiturn_adversarial | adversarial_adaptation |
| # | Agent | Score | # | Agent | Score |
| #1 | Oracle-Agent | 0.9999 | #1 | Oracle-Agent | 0.9999 |
| #2 | Qwen-3-235B-Cerebras | 0.8275 | #2 | GPT-4o-mini-ZeroShot | 0.4820 |
| #3 | Qwen-3-235B-Cerebras | 0.8275 | #3 | AllowAll-Baseline | 0.1500 |
| #4 | llama-3.3-70b-versatile | 0.6866 | #4 | RefuseAll-Baseline | 0.0001 |
| #5 | GPT-4o-mini-ZeroShot | 0.6120 | #5 | Qwen-3-235B-Cerebras | 0.0001 |
Quick Start (5 curl commands)
# 1. Reset to Task 1 — receive session_id and first observation
curl -s -X POST "https://varunventra-guardrail-arena.hf.space/reset?task_id=basic_threat_detection"
# 2. Submit an action (replace SESSION_ID and PROMPT_ID from step 1)
curl -s -X POST "https://varunventra-guardrail-arena.hf.space/step?session_id=SESSION_ID" \
-H "Content-Type: application/json" \
-d '{"prompt_id":"PROMPT_ID","action_type":"allow","reason":"Safe prompt","modified_prompt":null}'
# 3. Get grader score after episode completes
curl -s "https://varunventra-guardrail-arena.hf.space/grader?session_id=SESSION_ID"
# 4. Submit score to leaderboard
curl -s -X POST "https://varunventra-guardrail-arena.hf.space/submit?session_id=SESSION_ID&agent_name=MyAgent"
# 5. Try the hardest task: deterministic adversarial adaptation
curl -s -X POST "https://varunventra-guardrail-arena.hf.space/reset?task_id=adversarial_adaptation"
API Endpoints
| Method | Endpoint | Description |
| GET | / | This page |
| GET | /health | Health check — returns JSON status |
| POST | /reset | Reset environment. Params: task_id, optional seed (int) |
| POST | /step | Submit action, receive observation + reward + done + info |
| GET | /state | Current environment state |
| GET | /tasks | All task metadata + action schema |
| GET | /grader | Final grader score (0.0–1.0) after episode ends |
| GET | /demo | Pre-scripted 5-step demo episode with full trajectory JSON |
| GET | /leaderboard | Top 10 scores per task (in-memory) |
| GET | /baseline | Pre-computed baseline scores |
| GET | /curriculum | Curriculum endpoint — progressive task ordering |
| GET | /sessions | Active isolated sessions |
Full API docs: /docs · /redoc