LLM-as-a-judge scorers use a language model to evaluate outputs based on natural language criteria. They are best for subjective judgments like tone, helpfulness, or creativity that are difficult to encode in deterministic code. You can define LLM-as-a-judge scorers in three places:Documentation Index
Fetch the complete documentation index at: https://braintrust.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
- Inline in SDK code: Define scorers directly in your evaluation scripts for local development or application-specific logic.
- Pushed via CLI: Define scorers in TypeScript or Python files and push them to Braintrust for team-wide sharing and automatic evaluation of production logs.
- Created in UI: Build scorers in the Braintrust web interface for rapid prototyping and simple configurations.
Score spans
Span-level scorers evaluate individual operations or outputs. Use them for measuring single LLM responses, checking specific tool calls, or validating individual outputs. Each matching span receives an independent score. Your prompt template can reference these variables:{{input}}: The input to your task{{output}}: The output from your task{{expected}}: The expected output (optional){{metadata}}: Custom metadata from the test case
- SDK
- CLI
- UI
Use scorers inline in your evaluation code:
llm_scorer.eval.ts
Score traces
Trace-level scorers evaluate entire execution traces including all spans and conversation history. Use these for assessing multi-turn conversation quality, overall workflow completion, or when your scorer needs access to the full execution context. The scorer runs once per trace. Prompt templates for trace-level scorers support the following reserved variables:| Variable | Type | Description |
|---|---|---|
{{input}} | any | Input from the root span |
{{output}} | any | Output from the root span |
{{expected}} | any | Expected output from the root span (optional) |
{{metadata}} | object | Metadata from the root span |
{{thread}} | text | Full conversation rendered as human-readable text |
{{thread_count}} | number | Total number of messages in the thread |
{{first_message}} | object | First message in the thread |
{{last_message}} | object | Last message in the thread |
{{user_messages}} | array | All user/human messages only |
{{assistant_messages}} | array | All assistant messages only |
{{human_ai_pairs}} | array | Turn pairs — each item has {human, assistant} |
{{thread}} to pass the full conversation to a judge model as formatted text. {{input}}, {{output}}, {{expected}}, and {{metadata}} come from the root span of the trace.
Trace-level scoring requires TypeScript SDK v2.2.1+, Python SDK v0.5.6+, or Ruby SDK v0.2.1+.
- SDK
- CLI
- UI
Use scorers inline in your evaluation code:
trace_llm_scorer.eval.ts
Set pass thresholds
Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as passing (green highlighting with checkmark), while scores below are marked as failing (red highlighting).- CLI
- UI
Add
__pass_threshold to the scorer’s metadata (value between 0 and 1):Next steps
- Autoevals for pre-built scorers you can drop in without writing a prompt
- Custom code for deterministic logic or when you need full control
- Run evaluations using your scorers
- Score production logs with online scoring rules