Help us expand the benchmark. Submit new prompt datasets, improve the evaluator logic, or refine the frontend interface. Our community drives the standard.
The core of the benchmark relies on diverse, challenging prompts that test model compliance boundaries. Provide JSONL datasets categorized by trigger types.
Refine the Python backend. Add support for new API providers, improve the judge prompts, or optimize the concurrent evaluation pipeline.
Improve the LLM-as-Judge prompt templates or add new judging dimensions. Help us capture nuance beyond basic heuristics.
Improve the Next.js frontend, add data visualizations, enhance the leaderboard, or build new interactive evaluation tools.
Fork the main repository and create your feature branch (e.g., feature/new-prompts).
Make your changes. Run uv run pytest tests/ -v for backend and npm run build for frontend.
Submit your PR against the main branch with a clear description and any benchmark score shifts.
# Clone and setup
$ git clone https://github.com/retardbench/retardbench.git
$ uv sync
# Run tests
$ uv run pytest tests/ -v
# Start the backend
$ uv run retardbench serve
# Start the frontend
$ cd frontend && npm install && npm run dev