Root Cause Analysis for Hardware Test Failures
A test fails. The question is never just "what failed" but "why did it fail, and will it fail again?" TofuPilot stores every measurement from every run, so you can trace failures back to their source instead of guessing.
The Root Cause Analysis Problem in Hardware
Software bugs leave stack traces. Hardware failures leave measurements. The difference: a stack trace points to one line of code, but an out-of-spec voltage reading could mean a bad solder joint, a drifting power supply, a faulty test fixture, or a component lot issue.
Root cause analysis in hardware means correlating test data across multiple dimensions: time, station, component lot, operator, and environmental conditions.
Step 1: Identify the Failure Pattern
Start by filtering failed runs in TofuPilot's dashboard.
| Filter | Purpose |
|---|---|
| Procedure | Narrow to the specific test |
| Date range | Find when failures started |
| Station | Check if failures cluster on one station |
| Status: Failed | See only failed runs |
Look for clustering. If failures concentrate on one station, one shift, or one date range, you've already narrowed the search.
Step 2: Compare Failing vs. Passing Runs
TofuPilot lets you compare runs side by side. Select a failing run and a passing run from the same procedure, then compare their measurements.
Passing run (UNIT-3021): Failing run (UNIT-3045):
┌─────────────────────────┐ ┌─────────────────────────┐
│ vcc_3v3: 3.30 V ✓ │ │ vcc_3v3: 3.28 V ✓ │
│ vcc_1v8: 1.81 V ✓ │ │ vcc_1v8: 1.76 V ✗ │
│ clk_freq: 24.00 MHz ✓ │ │ clk_freq: 23.95 MHz ✓ │
│ current: 45 mA ✓ │ │ current: 62 mA ✗ │
└─────────────────────────┘ └─────────────────────────┘
In this example, the 1.8V rail is low and current draw is high. These two symptoms together point to a short or partial short on the 1.8V rail.
Step 3: Check the Measurement Trend
A single comparison shows you the "what." The trend shows you the "when." Open the measurement timeline for vcc_1v8 across all runs.
If the 1.8V reading was stable at 1.81V for weeks and then started dropping on March 3rd, something changed on March 3rd. Check:
- Was a new component lot introduced?
- Was the test fixture serviced?
- Did the station's power supply get recalibrated?
Step 4: Correlate with Component Lots
TofuPilot tracks unit metadata including component information. If failures cluster around units built with a specific component lot, you've found a supplier quality issue.
from tofupilot import TofuPilotClient
client = TofuPilotClient()
# Get all failed runs for a specific procedure
runs = client.get_runs(
procedure_id="BOARD-FUNCTIONAL",
run_passed=False,
limit=100,
)
# Check if failures correlate with component lots
lot_counts = {}
for run in runs:
lot = run.get("unit_under_test", {}).get("batch", "unknown")
lot_counts[lot] = lot_counts.get(lot, 0) + 1
for lot, count in sorted(lot_counts.items(), key=lambda x: -x[1]):
print(f"Lot {lot}: {count} failures")Step 5: Isolate the Variable
Root cause analysis is a process of elimination. TofuPilot helps you hold variables constant while changing one at a time:
| Variable | How to control it |
|---|---|
| Station | Filter by station ID |
| Operator | Filter by shift/time |
| Component lot | Filter by batch/lot |
| Fixture | Check station metadata |
| Environment | Compare with temperature logs |
When you can reproduce the failure by controlling one variable (e.g., "all failures are from Station 3"), you've isolated the root cause.
Common Root Cause Patterns
Station-Specific Failures
Failures cluster on one test station. Usually caused by:
- Worn pogo pins or test probes
- Loose cable connections
- Calibration drift on station instruments
Lot-Specific Failures
Failures cluster around units from a specific component lot. Usually caused by:
- Supplier quality escape
- Component parameter shift
- Wrong component revision
Time-Correlated Failures
Failures start at a specific date and continue. Usually caused by:
- Process change (solder profile, firmware version)
- Environmental change (humidity, temperature)
- Fixture wear reaching a threshold
Intermittent Failures
Failures appear randomly across stations and lots. Usually caused by:
- Marginal design (values close to limits)
- Test measurement noise
- Environmental sensitivity
From Root Cause to Corrective Action
Once you've identified the root cause, TofuPilot's data helps you verify the fix. Run the same tests after the corrective action and compare the measurement distributions before and after. If the 1.8V rail readings shift back to 1.81V and current draw returns to normal, your fix worked.
Track the corrective action's effectiveness over time. TofuPilot's trend views show whether the fix holds or if the problem returns.