Skip to content
Scaling & Monitoring

Test Station Monitoring and Performance

Learn how to monitor test station health, track uptime metrics, and detect performance degradation using TofuPilot analytics.

JJulien Buteau
intermediate8 min readMarch 14, 2026

A test station that's silently degrading is worse than one that's down. Slow tests eat throughput. Dropping yields hide root causes. You need visibility into station health before problems hit your production line.

This guide covers what to monitor, how to capture station health metrics inside your OpenHTF tests, and how to detect drift using TofuPilot's analytics.

What to Monitor

Four metrics tell you most of what you need to know about a test station.

Test throughput. Units per hour, per station. A drop means something changed: slower tests, more retests, or operator delays.

Pass rate trends. FPY over time, not just today's number. A slow decline from 97% to 93% over two weeks is easy to miss in daily reports.

Average test duration. Track per-phase and total. If your calibration phase went from 4s to 12s, the instrument connection is probably degrading.

Station errors. Uncaught exceptions, instrument timeouts, fixture faults. These don't always fail the DUT, but they signal trouble.

Capture Station Health Metrics

You can log station health data (CPU, memory, disk) as OpenHTF measurements alongside your DUT tests. This gives you a per-run snapshot of station condition.

station_health.py
import openhtf as htf
import psutil

@htf.measures(
    htf.Measurement("cpu_percent").in_range(maximum=90),
    htf.Measurement("memory_percent").in_range(maximum=85),
    htf.Measurement("disk_percent").in_range(maximum=90),
    htf.Measurement("disk_read_mb"),
    htf.Measurement("cpu_temp"),
)
def station_health_check(test):
    """Capture station health metrics before running DUT tests."""
    test.measurements.cpu_percent = psutil.cpu_percent(interval=1)
    test.measurements.memory_percent = psutil.virtual_memory().percent
    test.measurements.disk_percent = psutil.disk_usage("/").percent
    test.measurements.disk_read_mb = psutil.disk_io_counters().read_bytes / (1024 * 1024)

    # CPU temperature (Linux only, returns empty list on other platforms)
    temps = psutil.sensors_temperatures()
    if temps and "coretemp" in temps:
        test.measurements.cpu_temp = temps["coretemp"][0].current
    else:
        test.measurements.cpu_temp = 0.0

Add this phase at the start of your test sequence. If CPU or memory is pegged, you'll see it in TofuPilot before it causes flaky test results.

main_test.py
import openhtf as htf
from openhtf.util import units
from tofupilot.openhtf import TofuPilot
from station_health import station_health_check

# Your DUT test phases
@htf.measures(htf.Measurement("voltage_3v3").in_range(3.1, 3.5).with_units(units.VOLT))
def test_power_rail(test):
    test.measurements.voltage_3v3 = 3.28

def main():
    test = htf.Test(
        station_health_check,
        test_power_rail,
    )
    with TofuPilot(test):
        test.execute(test_start=lambda: "DUT-001")

if __name__ == "__main__":
    main()

Detect Performance Drift in TofuPilot

TofuPilot tracks test duration, pass rates, and measurement trends per station automatically. Use the Analytics tab to spot drift:

  • Test duration trend. Filter by station and check whether average test time is increasing. A 15%+ increase over baseline signals instrument connection degradation, fixture wear, or background process interference.
  • FPY by station. Compare yield across stations running the same procedure. A station with 3+ points lower FPY than its neighbors needs fixture inspection.
  • Measurement histograms. Check whether station health measurements (CPU, memory, disk) are creeping toward their limits over time.
  • Failure Pareto. If one station accounts for a disproportionate share of failures, investigate that station's fixture and connections.

Monitoring Checklist

MetricFrequencyThresholdAction
Test throughput (units/hr)HourlyBelow 80% of targetCheck for operator delays, instrument timeouts
First pass yieldPer shiftBelow 95% (or your target)Investigate top failing phases
Average test durationDailyMore than 15% above baselineCheck instrument connections, fixture wear
CPU usagePer runAbove 90%Close background processes, check for memory leaks
Memory usagePer runAbove 85%Restart station, check for leaking test processes
Disk usageDailyAbove 90%Clean logs, archive old data
Station errorsPer runAny uncaught exceptionFix root cause, add error handling
Instrument timeout rateDailyAbove 1%Check cables, GPIB/USB connections

More Guides

Put this guide into practice