Root Cause Analysis for Hardware Test Failures

A test fails. The question is never just "what failed" but "why did it fail, and will it fail again?" TofuPilot stores every measurement from every run, so you can trace failures back to their source instead of guessing.

The Root Cause Analysis Problem in Hardware

Software bugs leave stack traces. Hardware failures leave measurements. The difference: a stack trace points to one line of code, but an out-of-spec voltage reading could mean a bad solder joint, a drifting power supply, a faulty test fixture, or a component lot issue.

Root cause analysis in hardware means correlating test data across multiple dimensions: time, station, component lot, operator, and environmental conditions.

Step 1: Identify the Failure Pattern

Start by filtering failed runs in TofuPilot's dashboard.

Filter	Purpose
Procedure	Narrow to the specific test
Date range	Find when failures started
Station	Check if failures cluster on one station
Status: Failed	See only failed runs

Look for clustering. If failures concentrate on one station, one shift, or one date range, you've already narrowed the search.

Step 2: Compare Failing vs. Passing Runs

TofuPilot lets you compare runs side by side. Select a failing run and a passing run from the same procedure, then compare their measurements.

Passing run (UNIT-3021):          Failing run (UNIT-3045):
┌─────────────────────────┐      ┌─────────────────────────┐
│ vcc_3v3:  3.30 V  ✓     │      │ vcc_3v3:  3.28 V  ✓     │
│ vcc_1v8:  1.81 V  ✓     │      │ vcc_1v8:  1.76 V  ✗     │
│ clk_freq: 24.00 MHz ✓   │      │ clk_freq: 23.95 MHz ✓   │
│ current:  45 mA  ✓      │      │ current:  62 mA  ✗      │
└─────────────────────────┘      └─────────────────────────┘

In this example, the 1.8V rail is low and current draw is high. These two symptoms together point to a short or partial short on the 1.8V rail.

Step 3: Check the Measurement Trend

A single comparison shows you the "what." The trend shows you the "when." Open the measurement timeline for vcc_1v8 across all runs.

If the 1.8V reading was stable at 1.81V for weeks and then started dropping on March 3rd, something changed on March 3rd. Check:

Was a new component lot introduced?
Was the test fixture serviced?
Did the station's power supply get recalibrated?

Step 4: Correlate with Component Lots

TofuPilot tracks unit metadata including component information. If failures cluster around units built with a specific component lot, you've found a supplier quality issue.

lot_analysis.py

from tofupilot import TofuPilotClientclient = TofuPilotClient()# Get all failed runs for a specific procedureruns = client.get_runs(    procedure_id="BOARD-FUNCTIONAL",    run_passed=False,    limit=100,)# Check if failures correlate with component lotslot_counts = {}for run in runs:    lot = run.get("unit_under_test", {}).get("batch", "unknown")    lot_counts[lot] = lot_counts.get(lot, 0) + 1for lot, count in sorted(lot_counts.items(), key=lambda x: -x[1]):    print(f"Lot {lot}: {count} failures")

Step 5: Isolate the Variable

Root cause analysis is a process of elimination. TofuPilot helps you hold variables constant while changing one at a time:

Variable	How to control it
Station	Filter by station ID
Operator	Filter by shift/time
Component lot	Filter by batch/lot
Fixture	Check station metadata
Environment	Compare with temperature logs

When you can reproduce the failure by controlling one variable (e.g., "all failures are from Station 3"), you've isolated the root cause.

Common Root Cause Patterns

Station-Specific Failures

Failures cluster on one test station. Usually caused by:

Worn pogo pins or test probes
Loose cable connections
Calibration drift on station instruments

Lot-Specific Failures

Failures cluster around units from a specific component lot. Usually caused by:

Supplier quality escape
Component parameter shift
Wrong component revision

Time-Correlated Failures

Failures start at a specific date and continue. Usually caused by:

Process change (solder profile, firmware version)
Environmental change (humidity, temperature)
Fixture wear reaching a threshold

Intermittent Failures

Failures appear randomly across stations and lots. Usually caused by:

Marginal design (values close to limits)
Test measurement noise
Environmental sensitivity

From Root Cause to Corrective Action

Once you've identified the root cause, TofuPilot's data helps you verify the fix. Run the same tests after the corrective action and compare the measurement distributions before and after. If the 1.8V rail readings shift back to 1.81V and current draw returns to normal, your fix worked.

Track the corrective action's effectiveness over time. TofuPilot's trend views show whether the fix holds or if the problem returns.

Root Cause Analysis for Hardware Test Failures

Root Cause Analysis for Hardware Test Failures

The Root Cause Analysis Problem in Hardware

Step 1: Identify the Failure Pattern

Step 2: Compare Failing vs. Passing Runs

Step 3: Check the Measurement Trend

Step 4: Correlate with Component Lots

Step 5: Isolate the Variable

Common Root Cause Patterns

Station-Specific Failures

Lot-Specific Failures

Time-Correlated Failures

Intermittent Failures

From Root Cause to Corrective Action

More Guides

Put this guide into practice