Cookbook

Human Review & Labeling

This recipe walks through setting up a review queue, labeling production traces, and exporting the results to a dataset for regression testing or fine-tuning.

Prerequisites

A 2Signal project with traces already flowing in
MEMBER or higher role on the project

Step 1: Create a Review Queue

Go to Dashboard → Project → Review and click Create Queue.

Name: "Weekly QA — March 2026"
Description: "Sample of customer support traces for quality review"

Step 2: Add Traces

Navigate to Traces, filter for the traces you want to review (e.g., traces from the last 7 days with error status or low eval scores), select them, and click Add to Review Queue.

You can add traces individually or in bulk. Each trace becomes a pending review item.

Step 3: Review in Sequential Mode

Open the queue and click Start Reviewing to enter sequential mode. The screen splits into two panes:

Left — Full trace detail (spans, timeline, scores)
Right — Review panel (sentiment, label, notes)

Use keyboard shortcuts for speed:

1 → POSITIVE sentiment
2 → NEUTRAL sentiment
3 → NEGATIVE sentiment
Enter → Submit and advance to next trace

Add an optional label (e.g., "hallucination", "good response", "off-topic") and notes for context. Each submission also creates a TraceAnnotation on the trace.

Step 4: Export to Dataset

Once you've reviewed all items (or enough), click Export to Dataset. Only APPROVED items are included. Choose an existing dataset or create a new one.

The exported dataset items use the trace input/output as the dataset item input/expectedOutput, giving you a curated set of golden examples from real production traffic.

Step 5: Run Batch Eval

With your new dataset, go to Datasets, click Run Batch Eval, and select evaluators. This creates an eval run that scores every item in the dataset, giving you a baseline for regression testing.

Tips

Review 50–100 traces per session to avoid fatigue. Sequential mode with keyboard shortcuts lets you review ~30 traces in 10 minutes.
Use consistent labels across reviewers. Agree on a label taxonomy before starting (e.g., "correct", "partially correct", "hallucination", "refusal").
Export regularly and re-run batch evals after model or prompt changes to catch regressions.

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord