A test station that's silently degrading is worse than one that's down. Slow tests eat throughput. Dropping yields hide root causes. You need visibility into station health before problems hit your production line.
This guide covers what to monitor, how to capture station health metrics inside your OpenHTF tests, and how to detect drift using TofuPilot's analytics.
What to Monitor
Four metrics tell you most of what you need to know about a test station.
Test throughput. Units per hour, per station. A drop means something changed: slower tests, more retests, or operator delays.
Pass rate trends. FPY over time, not just today's number. A slow decline from 97% to 93% over two weeks is easy to miss in daily reports.
Average test duration. Track per-phase and total. If your calibration phase went from 4s to 12s, the instrument connection is probably degrading.
Station errors. Uncaught exceptions, instrument timeouts, fixture faults. These don't always fail the DUT, but they signal trouble.
Capture Station Health Metrics
You can log station health data (CPU, memory, disk) as OpenHTF measurements alongside your DUT tests. This gives you a per-run snapshot of station condition.
import openhtf as htf
import psutil
@htf.measures(
htf.Measurement("cpu_percent").in_range(maximum=90),
htf.Measurement("memory_percent").in_range(maximum=85),
htf.Measurement("disk_percent").in_range(maximum=90),
htf.Measurement("disk_read_mb"),
htf.Measurement("cpu_temp"),
)
def station_health_check(test):
"""Capture station health metrics before running DUT tests."""
test.measurements.cpu_percent = psutil.cpu_percent(interval=1)
test.measurements.memory_percent = psutil.virtual_memory().percent
test.measurements.disk_percent = psutil.disk_usage("/").percent
test.measurements.disk_read_mb = psutil.disk_io_counters().read_bytes / (1024 * 1024)
# CPU temperature (Linux only, returns empty list on other platforms)
temps = psutil.sensors_temperatures()
if temps and "coretemp" in temps:
test.measurements.cpu_temp = temps["coretemp"][0].current
else:
test.measurements.cpu_temp = 0.0Add this phase at the start of your test sequence. If CPU or memory is pegged, you'll see it in TofuPilot before it causes flaky test results.
import openhtf as htf
from openhtf.util import units
from tofupilot.openhtf import TofuPilot
from station_health import station_health_check
# Your DUT test phases
@htf.measures(htf.Measurement("voltage_3v3").in_range(3.1, 3.5).with_units(units.VOLT))
def test_power_rail(test):
test.measurements.voltage_3v3 = 3.28
def main():
test = htf.Test(
station_health_check,
test_power_rail,
)
with TofuPilot(test):
test.execute(test_start=lambda: "DUT-001")
if __name__ == "__main__":
main()Detect Performance Drift in TofuPilot
TofuPilot tracks test duration, pass rates, and measurement trends per station automatically. Use the Analytics tab to spot drift:
- Test duration trend. Filter by station and check whether average test time is increasing. A 15%+ increase over baseline signals instrument connection degradation, fixture wear, or background process interference.
- FPY by station. Compare yield across stations running the same procedure. A station with 3+ points lower FPY than its neighbors needs fixture inspection.
- Measurement histograms. Check whether station health measurements (CPU, memory, disk) are creeping toward their limits over time.
- Failure Pareto. If one station accounts for a disproportionate share of failures, investigate that station's fixture and connections.
Monitoring Checklist
| Metric | Frequency | Threshold | Action |
|---|---|---|---|
| Test throughput (units/hr) | Hourly | Below 80% of target | Check for operator delays, instrument timeouts |
| First pass yield | Per shift | Below 95% (or your target) | Investigate top failing phases |
| Average test duration | Daily | More than 15% above baseline | Check instrument connections, fixture wear |
| CPU usage | Per run | Above 90% | Close background processes, check for memory leaks |
| Memory usage | Per run | Above 85% | Restart station, check for leaking test processes |
| Disk usage | Daily | Above 90% | Clean logs, archive old data |
| Station errors | Per run | Any uncaught exception | Fix root cause, add error handling |
| Instrument timeout rate | Daily | Above 1% | Check cables, GPIB/USB connections |