Skip to main content

Documentation Index

Fetch the complete documentation index at: https://braintrust.dev/docs/llms.txt

Use this file to discover all available pages before exploring further.

Run evaluation files against Braintrust. Supports JavaScript and Python.
bt eval is currently macOS and Linux only.

File selection

  • bt eval — discover and run all eval files in the current directory (recursive)
  • bt eval tests/ — discover eval files under a specific directory
  • bt eval "tests/**/*.eval.ts" — glob pattern
  • bt eval a.eval.ts b.eval.ts — one or more explicit files
Files inside node_modules, .venv, venv, site-packages, dist-packages, and __pycache__ are excluded from automatic discovery. Explicit paths and globs bypass these exclusions.

Runtime configuration

Requires Node.js 18.19.0+ or 20.6.0+. Bun 1.0+ and Deno with Node compatibility mode are also supported.By default, bt eval auto-detects a runner from your project (tsx, vite-node, ts-node, then ts-node-esm). Set one explicitly with --runner / BT_EVAL_RUNNER:
bt eval --runner vite-node tutorial.eval.ts
bt eval --runner tsx tutorial.eval.ts
bt eval automatically resolves locally installed binaries from node_modules/.bin, so you can write --runner tsx instead of --runner ./node_modules/.bin/tsx (for example). If you see ESM or top-level await errors, try --runner vite-node.

Sampling modes

Run a subset of your evaluation data as a non-final smoke run to catch obvious regressions before committing to the full dataset.
bt eval --first 20 qa.eval.ts          # First 20 examples, non-final
bt eval --sample 20 qa.eval.ts         # Random 20 examples, non-final
bt eval --sample 20 --sample-seed 7 qa.eval.ts  # Reproducible random sample
bt eval qa.eval.ts                     # Full dataset, final
When --first or --sample is used, the experiment summary is labeled as non-final in Braintrust. Omitting both flags runs the full dataset and marks the summary as final.

Flags

FlagEnv varDescription
--runner <RUNNER>BT_EVAL_RUNNERRunner binary (tsx, bun, ts-node, python, etc.)
--language <LANG>BT_EVAL_LANGUAGEForce language: js or py
--filter <PATTERN>BT_EVAL_FILTERRun only evaluators matching the pattern
--first <N>BT_EVAL_FIRSTRun only the first N examples (non-final smoke run)
--sample <N>BT_EVAL_SAMPLERun a deterministic random sample of N examples (non-final smoke run)
--sample-seed <S>BT_EVAL_SAMPLE_SEEDInteger seed for --sample (default: 0)
--param <KEY=VALUE>BT_EVAL_PARAMS_JSONPass a named parameter into evaluators that declare a parameters schema (repeatable; also accepts a JSON object string)
--matrix-param <KEY=V1,V2,...>Run one experiment per Cartesian-product combination of parameter values (repeatable). Requires exactly one evaluator (use --filter to select it). Incompatible with --watch, --dev, and --list
--watch / -wBT_EVAL_WATCHRe-run when input files change
--no-send-logsBT_EVAL_LOCALRun without sending results to Braintrust
--num-workers <N>Worker threads for Python execution
--verboseShow full errors and stderr from eval files
--listList evaluators without running them
--jsonlOutput one JSON summary per evaluator (for scripts). See also the global --json flag (overview), which formats all CLI output as JSON rather than per-evaluator summaries.
--terminate-on-failureStop after the first failing evaluator
--devStart a local web server for browser-based eval development (default port: 8300)

Summary output

When using --jsonl or reading SSE output, each evaluator summary object includes these fields:
FieldTypeDescription
runMode"full" | "first" | "sample"How the eval was run
isFinalbooleanWhether this is a final (full-dataset) run
runLabelstringHuman-readable description of the run mode
sampleCountnumberNumber of examples sampled (only present when --first or --sample is used)
sampleSeednumberSeed used for random sampling (only present when --sample is used)

Parameters

--param overrides values for evaluators that declare a parameters schema via loadParameters() (TypeScript) or load_parameters() (Python). This is the same parameters system used by remote evals, where parameters are version-tracked in Braintrust and appear as UI controls in the playground. See Create evaluation parameters for how to define and load parameters. Each evaluator only receives the keys it declares. Extra keys are silently filtered, so a single command can target multiple evaluators with different schemas without errors.
bt eval --param model=gpt-4o --param count=5 my.eval.ts
bt eval --param '{"model":"gpt-4o","count":5}' my.eval.ts
Parameters are validated against the evaluator’s declared schema before execution. Evaluators without a parameters schema are unaffected.

Parameter matrix

--matrix-param works with the same parameters system as --param. Specify multiple values for one or more parameters and bt eval runs one experiment per combination, naming each <experiment-name> [key=value, ...].
# Sweep a single parameter across three values
bt eval --matrix-param model=gpt-4o,gpt-4o-mini,o1-mini my.eval.ts

# Sweep two parameters — runs 2 × 3 = 6 experiments
bt eval --matrix-param model=gpt-4o,gpt-4o-mini --matrix-param temperature=0.0,0.5,1.0 my.eval.ts
For values that contain commas, use a JSON array:
bt eval --matrix-param model='["gpt-4o","claude-3-5-haiku-20241022"]' my.eval.ts
--matrix-param requires exactly one evaluator to be selected. If your file exports multiple evaluators, use --filter to narrow down to one. It is not supported for eval files that export btEvalMain.

Passing arguments to the eval file

Use -- to forward extra arguments to the eval file via process.argv:
bt eval foo.eval.ts -- --description "Prod" --shard 1/4

Running in CI

Set BRAINTRUST_API_KEY instead of using OAuth login:
# GitHub Actions example
- name: Run evals
  env:
    BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
  run: bt eval tests/
Use --no-input and --json for non-interactive output:
BRAINTRUST_API_KEY=... bt eval tests/ --no-input --json