Open Source

Contribute to RetardBench

Help us expand the benchmark. Submit new prompt datasets, improve the evaluator logic, or refine the frontend interface. Our community drives the standard.

Add Prompt Packs

The core of the benchmark relies on diverse, challenging prompts that test model compliance boundaries. Provide JSONL datasets categorized by trigger types.

  • Must include trigger and generic variants
  • Minimum 10 new prompts per category
  • Follow JSONL format with id, category, prompt fields

Improve Evaluator Logic

Refine the Python backend. Add support for new API providers, improve the judge prompts, or optimize the concurrent evaluation pipeline.

  • Python 3.12+ compatibility
  • Include unit tests via pytest
  • Follow existing code style and patterns

Enhance the Judge

Improve the LLM-as-Judge prompt templates or add new judging dimensions. Help us capture nuance beyond basic heuristics.

  • Test with multiple judge models
  • Validate JSON output parsing
  • Maintain heuristic fallback compatibility

Frontend & UX

Improve the Next.js frontend, add data visualizations, enhance the leaderboard, or build new interactive evaluation tools.

  • Use existing Tailwind design system
  • Ensure mobile responsiveness
  • Follow component-based architecture

Standard Workflow

  1. 1

    Fork & Branch

    Fork the main repository and create your feature branch (e.g., feature/new-prompts).

  2. 2

    Develop & Test Locally

    Make your changes. Run uv run pytest tests/ -v for backend and npm run build for frontend.

  3. 3

    Open a Pull Request

    Submit your PR against the main branch with a clear description and any benchmark score shifts.

Quick Start

# Clone and setup

$ git clone https://github.com/retardbench/retardbench.git

$ uv sync

# Run tests

$ uv run pytest tests/ -v

# Start the backend

$ uv run retardbench serve

# Start the frontend

$ cd frontend && npm install && npm run dev