Human Review & Labeling
This recipe walks through setting up a review queue, labeling production traces, and exporting the results to a dataset for regression testing or fine-tuning.
Prerequisites
- A 2Signal project with traces already flowing in
- MEMBER or higher role on the project
Step 1: Create a Review Queue
Go to Dashboard → Project → Review and click Create Queue.
Name: "Weekly QA — March 2026"
Description: "Sample of customer support traces for quality review"Step 2: Add Traces
Navigate to Traces, filter for the traces you want to review (e.g., traces from the last 7 days with error status or low eval scores), select them, and click Add to Review Queue.
You can add traces individually or in bulk. Each trace becomes a pending review item.
Step 3: Review in Sequential Mode
Open the queue and click Start Reviewing to enter sequential mode. The screen splits into two panes:
- Left — Full trace detail (spans, timeline, scores)
- Right — Review panel (sentiment, label, notes)
Use keyboard shortcuts for speed:
1 → POSITIVE sentiment
2 → NEUTRAL sentiment
3 → NEGATIVE sentiment
Enter → Submit and advance to next traceAdd an optional label (e.g., "hallucination", "good response", "off-topic") and notes for context. Each submission also creates a TraceAnnotation on the trace.
Step 4: Export to Dataset
Once you've reviewed all items (or enough), click Export to Dataset. Only APPROVED items are included. Choose an existing dataset or create a new one.
The exported dataset items use the trace input/output as the dataset item input/expectedOutput, giving you a curated set of golden examples from real production traffic.
Step 5: Run Batch Eval
With your new dataset, go to Datasets, click Run Batch Eval, and select evaluators. This creates an eval run that scores every item in the dataset, giving you a baseline for regression testing.
Tips
- Review 50–100 traces per session to avoid fatigue. Sequential mode with keyboard shortcuts lets you review ~30 traces in 10 minutes.
- Use consistent labels across reviewers. Agree on a label taxonomy before starting (e.g., "correct", "partially correct", "hallucination", "refusal").
- Export regularly and re-run batch evals after model or prompt changes to catch regressions.