Cookbook

Run Evaluations in CI/CD

Block deploys when agent quality regresses. Run your dataset evaluations in CI and fail the pipeline if scores drop below thresholds. This recipe walks through setting up a GitHub Actions workflow that runs your evaluators against a regression test dataset on every pull request.

Step 1: Create a Regression Test Dataset

Build a dataset in the dashboard with 20–50 representative inputs that cover happy paths, edge cases, and known failure modes. Each item should include an input (the prompt or user message your agent receives) and an expected_output (the response you consider correct). Keep the dataset small enough to run in under five minutes but broad enough to catch meaningful regressions.

Step 2: GitHub Actions Workflow

Add the following workflow to .github/workflows/agent-eval.yml in your repository. It installs the 2Signal CLI, runs your evaluators against the dataset, and fails the job if any evaluator drops below the configured threshold.

name: Agent Evaluation
on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install 2Signal CLI
        run: pip install twosignal

      - name: Run regression evaluations
        env:
          TWOSIGNAL_API_KEY: ${{ secrets.TWOSIGNAL_API_KEY }}
        run: |
          twosignal eval run \
            --dataset regression-tests \
            --evaluators helpfulness,format-check,latency-sla \
            --fail-below 0.85 \
            --timeout 300

      - name: Run cost check
        env:
          TWOSIGNAL_API_KEY: ${{ secrets.TWOSIGNAL_API_KEY }}
        run: |
          twosignal eval run \
            --dataset regression-tests \
            --evaluators cost-budget \
            --fail-below 0.90 \
            --timeout 300

Store your TWOSIGNAL_API_KEY as a repository secret in Settings → Secrets and variables → Actions. The key should be scoped to the project that owns the dataset and evaluators.

Step 3: Interpret Results

The CLI prints a summary table after each run. For every evaluator you will see:

  • Evaluator name — the name you gave the evaluator when you created it.
  • Pass rate — the percentage of dataset items that scored above the pass threshold (typically 0.5).
  • Average score — the mean score across all dataset items, between 0.0 and 1.0.
  • Threshold met — whether the pass rate met the --fail-below value you configured.

If any evaluator falls below the --fail-below threshold, the step exits with code 1 and the GitHub Actions job fails, blocking the pull request from merging.

Step 4: Set Thresholds Wisely

Not all evaluators deserve the same threshold. Structural checks should be strict while subjective LLM-based evaluations need more room for variance.

Evaluator TypeRecommended ThresholdRationale
Structural (Contains, Regex, JSON Schema)0.95–1.0Format violations are bugs
Latency0.85–0.95Allow some variance
Cost0.85–0.95Allow some variance
LLM Judge0.75–0.85Subjective, expect variance
Similarity0.80–0.90Output won't be identical

Tips

  • Run structural evaluators first. They are fast, free, and catch obvious regressions before you spend time and tokens on LLM Judge evaluations.
  • Separate cost checks into their own step so that format failures don't mask cost issues. If the regression step fails, the cost step still runs and you get visibility into both dimensions.
  • Version your config with your code. Store your dataset ID and evaluator names in a config file in your repository so they stay in sync with the code they test.
  • Use --timeout to prevent CI from hanging. LLM Judge evaluations call an external model and can be slow. A 300-second timeout is a reasonable default.

What's Next

  • Datasets — Learn how to create and manage datasets for evaluation.
  • Evaluators — Explore all available evaluator types and their configuration options.

Have questions? Join our community!

Connect with other developers and the 2Signal team.

Join Discord