Skip to content
Concepts & Methodology

Root Cause Analysis for Hardware Test Failures

Learn how to use TofuPilot's test data to trace hardware failures back to their root cause using measurement trends and run comparisons.

JJulien Buteau
intermediate11 min readMarch 14, 2026

Root Cause Analysis for Hardware Test Failures

A test fails. The question is never just "what failed" but "why did it fail, and will it fail again?" TofuPilot stores every measurement from every run, so you can trace failures back to their source instead of guessing.

The Root Cause Analysis Problem in Hardware

Software bugs leave stack traces. Hardware failures leave measurements. The difference: a stack trace points to one line of code, but an out-of-spec voltage reading could mean a bad solder joint, a drifting power supply, a faulty test fixture, or a component lot issue.

Root cause analysis in hardware means correlating test data across multiple dimensions: time, station, component lot, operator, and environmental conditions.

Step 1: Identify the Failure Pattern

Start by filtering failed runs in TofuPilot's dashboard.

FilterPurpose
ProcedureNarrow to the specific test
Date rangeFind when failures started
StationCheck if failures cluster on one station
Status: FailedSee only failed runs

Look for clustering. If failures concentrate on one station, one shift, or one date range, you've already narrowed the search.

Step 2: Compare Failing vs. Passing Runs

TofuPilot lets you compare runs side by side. Select a failing run and a passing run from the same procedure, then compare their measurements.

Passing run (UNIT-3021): Failing run (UNIT-3045): ┌─────────────────────────┐ ┌─────────────────────────┐ │ vcc_3v3: 3.30 V ✓ │ │ vcc_3v3: 3.28 V ✓ │ │ vcc_1v8: 1.81 V ✓ │ │ vcc_1v8: 1.76 V ✗ │ │ clk_freq: 24.00 MHz ✓ │ │ clk_freq: 23.95 MHz ✓ │ │ current: 45 mA ✓ │ │ current: 62 mA ✗ │ └─────────────────────────┘ └─────────────────────────┘

In this example, the 1.8V rail is low and current draw is high. These two symptoms together point to a short or partial short on the 1.8V rail.

Step 3: Check the Measurement Trend

A single comparison shows you the "what." The trend shows you the "when." Open the measurement timeline for vcc_1v8 across all runs.

If the 1.8V reading was stable at 1.81V for weeks and then started dropping on March 3rd, something changed on March 3rd. Check:

  • Was a new component lot introduced?
  • Was the test fixture serviced?
  • Did the station's power supply get recalibrated?

Step 4: Correlate with Component Lots

TofuPilot tracks unit metadata including component information. If failures cluster around units built with a specific component lot, you've found a supplier quality issue.

lot_analysis.py
from tofupilot import TofuPilotClient

client = TofuPilotClient()

# Get all failed runs for a specific procedure
runs = client.get_runs(
    procedure_id="BOARD-FUNCTIONAL",
    run_passed=False,
    limit=100,
)

# Check if failures correlate with component lots
lot_counts = {}
for run in runs:
    lot = run.get("unit_under_test", {}).get("batch", "unknown")
    lot_counts[lot] = lot_counts.get(lot, 0) + 1

for lot, count in sorted(lot_counts.items(), key=lambda x: -x[1]):
    print(f"Lot {lot}: {count} failures")

Step 5: Isolate the Variable

Root cause analysis is a process of elimination. TofuPilot helps you hold variables constant while changing one at a time:

VariableHow to control it
StationFilter by station ID
OperatorFilter by shift/time
Component lotFilter by batch/lot
FixtureCheck station metadata
EnvironmentCompare with temperature logs

When you can reproduce the failure by controlling one variable (e.g., "all failures are from Station 3"), you've isolated the root cause.

Common Root Cause Patterns

Station-Specific Failures

Failures cluster on one test station. Usually caused by:

  • Worn pogo pins or test probes
  • Loose cable connections
  • Calibration drift on station instruments

Lot-Specific Failures

Failures cluster around units from a specific component lot. Usually caused by:

  • Supplier quality escape
  • Component parameter shift
  • Wrong component revision

Time-Correlated Failures

Failures start at a specific date and continue. Usually caused by:

  • Process change (solder profile, firmware version)
  • Environmental change (humidity, temperature)
  • Fixture wear reaching a threshold

Intermittent Failures

Failures appear randomly across stations and lots. Usually caused by:

  • Marginal design (values close to limits)
  • Test measurement noise
  • Environmental sensitivity

From Root Cause to Corrective Action

Once you've identified the root cause, TofuPilot's data helps you verify the fix. Run the same tests after the corrective action and compare the measurement distributions before and after. If the 1.8V rail readings shift back to 1.81V and current draw returns to normal, your fix worked.

Track the corrective action's effectiveness over time. TofuPilot's trend views show whether the fix holds or if the problem returns.

More Guides

Put this guide into practice