GRPO OpenEnv Live
Welcome to the CounterFeint Arena! Click Run Auto Match to watch a live simulation using deterministic scripted agents — a reactive Fraudster, a heuristic Investigator, and a rule-based Auditor. Watch reward curves evolve in real-time as agents compete step by step. For LLM-based agent results and trained model comparisons, check the Results tab.
Auditor Post-hoc reasoning &
plausibility auditor
AUDITS BOTH AGENTS
Fraudster Proposes & modifies
deceptive ads
📜
Shared Ad Queue Ads accumulate here.
Both agents see it.
Investigator Investigates ads &
renders verdicts
Ready
🤖 Fraudster Turn
🔍 Investigator Turn
⚖ Audit Phase
✔ Done
Round
-
Total Steps
-
Proposals Used
-
Grader Score
-
End Reason
-
Agent Reward Trajectories
Fraudster
Investigator
Fraudster
Adversarial ad proposer
0.00
Run a match to see fraudster actions.
📜 Ad Queue 0 ads
No ads yet
Investigator
Evidence-based reviewer
0.00
Run a match to see investigator actions.
Auditor
Post-hoc reasoning & plausibility auditor
0.00
Auditor acts after the match concludes. Run a match to see audit results.
🕑 Match Timeline 0 events
Step into the Investigator's shoes! Select a task difficulty below, click Reset environment to begin, then investigate ads by examining advertiser histories, landing pages, and payment methods. Render your verdict (approve, reject, or escalate) on each ad. Every investigation costs budget — balance thoroughness against efficiency to maximize your cumulative reward.
Total ads
-
Reviewed
-
Budget left
-
Step
-
Env score
-
Cum. reward
-
Cumulative Reward
Select a task and reset to begin.
Ad queue
Subject profile
Investigation findings
RL intelligence log
Take action
Verdict history
Training overview & model comparison. Explore baseline scores, reward design, and the GRPO training pipeline below. Use Run Demo Match to see a live animated simulation. Model comparison curves will be updated as new models are trained and evaluated.
GRPO Training Curves — Qwen3-0.6B
GRPO training — Loss, Mean Reward, KL Divergence
GRPO loss converges, mean reward trends upward, and KL divergence stays controlled — indicating stable policy improvement over 25 training steps.
Reward Design
ActionRewardRationale
Investigation-0.02Time/latency cost
Correct rejection+0.30 to +0.40Scaled by severity
Correct approval+0.10Revenue preserved
False positive-0.35Lost advertiser revenue
False negative-0.50Fraud goes live
Correct link+0.40Ring detection
Multi-Agent Reward Functions
Fraudster Reward
∑ severity × plausibility for fraud ads not rejected, minus penalty per rejected ad. Higher plausibility = more reward for evasion.
Investigator Reward
Base grader score + plausibility-weighted clean rationale bonus − capped inconsistency penalty. Track A flags strip the bonus.
Auditor Reward
Reward for true-positive flags vs ground truth, minus false-positive penalty. Deterministic rule-based scorecards.
Training Pipeline — GRPO Self-Play
🤖
Frozen Fraudster llama3.1:8b via Ollama
(8B params, frozen)
💥
🤖
Trainable Investigator Qwen3-0.6B + QLoRA
(GRPO training)
📋
Deterministic Auditor Rule-based scorecards
(reward source)
Sequential self-play: train one agent at a time against frozen opponents (AlphaGo paradigm)
📈 Live Match Reward Curves
Click “Run Demo Match” to generate animated reward curves from a live simulation.
🧠 Model Comparison — Investigator Reward Curves
Model comparison curves will appear here as training progresses.
Planned models: Untrained Qwen3.5-0.8B, Fine-tuned Qwen3.5-0.8B, and more.