Open Source

Contribute to RetardBench

Help us expand the benchmark. Submit new prompt datasets, improve the evaluator logic, or refine the frontend interface. Our community drives the standard.

Add Prompt Packs

The core of the benchmark relies on diverse, challenging prompts that test model compliance boundaries. Provide JSONL datasets categorized by trigger types.

Must include trigger and generic variants
Minimum 10 new prompts per category
Follow JSONL format with id, category, prompt fields

Improve Evaluator Logic

Refine the Python backend. Add support for new API providers, improve the judge prompts, or optimize the concurrent evaluation pipeline.

Python 3.12+ compatibility
Include unit tests via pytest
Follow existing code style and patterns

Enhance the Judge

Improve the LLM-as-Judge prompt templates or add new judging dimensions. Help us capture nuance beyond basic heuristics.

Test with multiple judge models
Validate JSON output parsing
Maintain heuristic fallback compatibility

Frontend & UX

Improve the Next.js frontend, add data visualizations, enhance the leaderboard, or build new interactive evaluation tools.

Use existing Tailwind design system
Ensure mobile responsiveness
Follow component-based architecture

Standard Workflow

1
Fork & Branch
Fork the main repository and create your feature branch (e.g., feature/new-prompts).
2
Develop & Test Locally
Make your changes. Run uv run pytest tests/ -v for backend and npm run build for frontend.
3
Open a Pull Request
Submit your PR against the main branch with a clear description and any benchmark score shifts.

Quick Start

# Clone and setup

$ git clone https://github.com/retardbench/retardbench.git

$ uv sync

# Run tests

$ uv run pytest tests/ -v

# Start the backend

$ uv run retardbench serve

# Start the frontend

$ cd frontend && npm install && npm run dev

View Repository Contact Team