🛡️ Guardrail Arena

LLM Safety Guardrail Training Environment — OpenEnv-spec RL environment for training content safety agents

An RL environment where agents learn to moderate content safety: observe user prompts with conversation history and context, then choose allow / refuse / modify / escalate. Rewards are asymmetric: missing an attack costs 2× more than over-blocking. Dynamic user risk scores compound future penalties for missed attacks. Tasks 3 & 4 require adversarial trajectory reasoning — no stateless classifier can solve them.

Total Episodes Run36
Tasks4
Oracle Score1.0000
LLM Baseline0.9857

Tasks

Task IDDifficultyDatasetAll-AllowAll-RefuseLLM Baseline
basic_threat_detectionEasy67 eval / 131 train0.37500.35340.9857 (llama-3.3-70b)
context_aware_policyMedium83 eval / 124 train0.40370.34600.7713 (llama-3.3-70b)
multiturn_adversarialHard238+ turns / 39 convos0.16070.06880.6866 (llama-3.3-70b)
adversarial_adaptationExpert12 dynamic / session0.15000.00010.4820 (gpt-4o-mini)

What Makes This Different

BenchmarkSingle-turnMulti-turnAdversarial adaptationTraining environment
WildGuard
HarmBench
AdvBench
Guardrail Arena

Leaderboard (Top 5 per Task)

basic_threat_detectioncontext_aware_policy
#AgentScore#AgentScore
#1Oracle-Agent0.9999#1Oracle-Agent0.9999
#2llama-3.3-70b-versatile0.9857#2llama-3.3-70b-versatile0.7713
#3Qwen-3-235B-Cerebras0.9857#3GPT-4o-mini-ZeroShot0.7512
#4Qwen-3-235B-Cerebras0.9857#4Qwen-3-235B-Cerebras0.6862
#5GPT-4o-mini-ZeroShot0.9216#5Qwen-3-235B-Cerebras0.6862
multiturn_adversarialadversarial_adaptation
#AgentScore#AgentScore
#1Oracle-Agent0.9999#1Oracle-Agent0.9999
#2Qwen-3-235B-Cerebras0.8275#2GPT-4o-mini-ZeroShot0.4820
#3Qwen-3-235B-Cerebras0.8275#3AllowAll-Baseline0.1500
#4llama-3.3-70b-versatile0.6866#4RefuseAll-Baseline0.0001
#5GPT-4o-mini-ZeroShot0.6120#5Qwen-3-235B-Cerebras0.0001

Quick Start (5 curl commands)

# 1. Reset to Task 1 — receive session_id and first observation
curl -s -X POST "https://varunventra-guardrail-arena.hf.space/reset?task_id=basic_threat_detection"

# 2. Submit an action (replace SESSION_ID and PROMPT_ID from step 1)
curl -s -X POST "https://varunventra-guardrail-arena.hf.space/step?session_id=SESSION_ID" \
  -H "Content-Type: application/json" \
  -d '{"prompt_id":"PROMPT_ID","action_type":"allow","reason":"Safe prompt","modified_prompt":null}'

# 3. Get grader score after episode completes
curl -s "https://varunventra-guardrail-arena.hf.space/grader?session_id=SESSION_ID"

# 4. Submit score to leaderboard
curl -s -X POST "https://varunventra-guardrail-arena.hf.space/submit?session_id=SESSION_ID&agent_name=MyAgent"

# 5. Try the hardest task: deterministic adversarial adaptation
curl -s -X POST "https://varunventra-guardrail-arena.hf.space/reset?task_id=adversarial_adaptation"

API Endpoints

MethodEndpointDescription
GET/This page
GET/healthHealth check — returns JSON status
POST/resetReset environment. Params: task_id, optional seed (int)
POST/stepSubmit action, receive observation + reward + done + info
GET/stateCurrent environment state
GET/tasksAll task metadata + action schema
GET/graderFinal grader score (0.0–1.0) after episode ends
GET/demoPre-scripted 5-step demo episode with full trajectory JSON
GET/leaderboardTop 10 scores per task (in-memory)
GET/baselinePre-computed baseline scores
GET/curriculumCurriculum endpoint — progressive task ordering
GET/sessionsActive isolated sessions

Full API docs: /docs · /redoc