Skip to content
Test Data & Analytics

Run-over-Run Test Comparison with TofuPilot

Learn how to compare hardware test runs side by side in TofuPilot to diagnose failures and track measurement changes across units.

JJulien Buteau
beginner8 min readMarch 13, 2026

Run-over-Run Test Comparison with TofuPilot

When a unit fails, the first question is always: "What's different from a passing unit?" Run-over-run comparison in TofuPilot lets you put two or more test runs side by side and see exactly where they diverge.

When to Use Run Comparison

  • Diagnosing a failure: Compare a failed run to a recent passing run for the same procedure
  • Investigating a retest: Compare first test to retest for the same unit
  • Validating a fix: Compare runs before and after a corrective action
  • Tracking unit history: Compare the same unit's results across different test stages
  • Benchmarking stations: Compare the same unit tested on different stations

How Run Comparison Works

TofuPilot stores every measurement from every run. When you compare runs, the system aligns measurements by name and shows the values side by side with their limits.

Run A (Pass) Run B (Fail) ──────────── ──────────── vcc_3v3 3.30 V ✓ 3.28 V ✓ vcc_1v8 1.81 V ✓ 1.74 V ✗ clk_freq 24.00 MHz ✓ 23.98 MHz ✓ boot_time 320 ms ✓ 1,240 ms ✗ current_idle 45 mA ✓ 78 mA ✗

Three measurements differ significantly. The 1.8V rail is low, boot time is 4x longer, and idle current is 73% higher. These symptoms together point to a partial short on the 1.8V power rail causing excess current draw and slow boot.

Comparing Runs in TofuPilot

Step 1: Find the Runs

Navigate to the procedure page and filter to find the runs you want to compare. Common filters:

FilterUse case
Status: FailedFind failing runs to diagnose
Serial numberFind all runs for a specific unit
Date rangeNarrow to a time period
StationCompare across stations

Step 2: Select Runs for Comparison

Select two or more runs from the run list. TofuPilot aligns their measurements by step name and measurement name.

Step 3: Read the Comparison

Focus on measurements where the values differ significantly. Small variations (3.30V vs. 3.31V) are normal measurement noise. Large deviations (1.81V vs. 1.74V) indicate a real difference.

Color coding helps:

  • Green: Both values within limits
  • Red: Value outside limits
  • Yellow: Value within limits but significantly different from the reference

Common Comparison Patterns

Pattern 1: Single Measurement Failure

One measurement fails, everything else is identical. This usually means:

  • Component value out of tolerance
  • Solder defect on that specific circuit
  • Test probe contact issue (retest to confirm)

Pattern 2: Correlated Failures

Multiple related measurements fail together (e.g., voltage low + current high + boot slow). This points to a systemic issue:

  • Power rail problem affecting multiple circuits
  • Firmware crash causing downstream test failures
  • Fixture contact issue on a shared connection

Pattern 3: All Measurements Shifted

Every measurement is slightly different from the reference, but most are still within limits. This suggests:

  • Different environmental conditions (temperature affecting all measurements)
  • Different station (instrument calibration differences)
  • Different component lot (systematic parameter shift)

Pattern 4: Intermittent Failure

Same unit, same station, same procedure. Sometimes passes, sometimes fails. Compare the passing and failing runs:

  • If the failing measurement is always the same one, it's a marginal value near a limit
  • If different measurements fail each time, it's likely a contact issue (pogo pin, cable)
  • If the pattern is time-dependent, check for thermal effects

Comparing Across Production Batches

Run comparison isn't just for debugging. Use it to validate that a new production batch matches the previous one.

  1. Select a representative passing run from batch N
  2. Select the first runs from batch N+1
  3. Compare measurement distributions

If batch N+1 measurements are systematically shifted (even if still within limits), investigate before the full batch runs through production.

Using the API for Programmatic Comparison

compare_runs.py
from tofupilot import TofuPilotClient

client = TofuPilotClient()

# Get two runs to compare
run_pass = client.get_run(run_id="run-id-pass")
run_fail = client.get_run(run_id="run-id-fail")

# Compare measurements
for step_p, step_f in zip(run_pass["steps"], run_fail["steps"]):
    for m_p, m_f in zip(step_p["measurements"], step_f["measurements"]):
        diff = abs(m_p["value"] - m_f["value"])
        if diff > 0:
            pct = diff / m_p["value"] * 100 if m_p["value"] != 0 else float("inf")
            status = "DIFF" if pct > 5 else "ok"
            print(f"{m_p['name']:30s} {m_p['value']:10.3f} {m_f['value']:10.3f} {pct:6.1f}% {status}")

This script highlights measurements that differ by more than 5%, giving you a quick programmatic way to identify where two runs diverge.

More Guides

Put this guide into practice