Skip to content
Scaling & Monitoring

Production Test Validation at Scale

Learn how to validate your production test process at scale using TofuPilot's statistical analysis, Cpk tracking, and yield monitoring.

JJulien Buteau
advanced11 min readMarch 14, 2026

Production Test Validation at Scale with TofuPilot

Running a test that works for 10 prototypes is different from running it for 10,000 production units. At scale, you need to validate not just the product, but the test process itself. Are your limits correct? Is your test repeatable? Are you catching real defects without creating false failures? TofuPilot's analytics help answer these questions.

What Production Test Validation Means

Production test validation (PVT) answers three questions:

  1. Are the test limits correct? Limits that are too tight cause false failures. Limits that are too loose let defective units ship.
  2. Is the test repeatable? The same unit tested twice should give the same result.
  3. Is the test effective? Does it catch the defects it's supposed to catch?

Step 1: Analyze Measurement Distributions

After running your test on the first 100-200 production units, analyze the measurement distributions in TofuPilot.

distribution_analysis.py
import numpy as np
from tofupilot import TofuPilotClient

client = TofuPilotClient()

runs = client.get_runs(
    procedure_id="FINAL-FUNCTIONAL-V3",
    limit=200,
)

# Extract measurement values
vcc_values = []
for run in runs:
    for step in run.get("steps", []):
        for m in step.get("measurements", []):
            if m["name"] == "vcc_3v3":
                vcc_values.append(m["value"])

values = np.array(vcc_values)
print(f"N:      {len(values)}")
print(f"Mean:   {np.mean(values):.4f} V")
print(f"Std:    {np.std(values, ddof=1):.4f} V")
print(f"Min:    {np.min(values):.4f} V")
print(f"Max:    {np.max(values):.4f} V")
print(f"Range:  {np.max(values) - np.min(values):.4f} V")

What to look for:

ObservationAction
Distribution centered within limits, Cpk > 1.33Limits are well-set
Distribution skewed toward one limitInvestigate process bias
Distribution wider than expectedTighten process or widen limits
Outliers beyond 3-sigmaInvestigate those specific units
Bimodal distributionTwo populations, likely mixed lots

Step 2: Calculate Process Capability

Cpk tells you how well your process fits within the test limits. TofuPilot provides the measurement data; you calculate the Cpk.

cpk_validation.py
import numpy as np

def calculate_cpk(values, lsl, usl):
    mean = np.mean(values)
    std = np.std(values, ddof=1)
    cpu = (usl - mean) / (3 * std)
    cpl = (mean - lsl) / (3 * std)
    cpk = min(cpu, cpl)
    return cpk, mean, std

# From TofuPilot data
vcc_values = [3.30, 3.31, 3.29, 3.32, 3.30, 3.31, 3.29, 3.30, 3.33, 3.31]
lsl, usl = 3.25, 3.35

cpk, mean, std = calculate_cpk(vcc_values, lsl, usl)
print(f"Cpk: {cpk:.2f}")
print(f"Mean: {mean:.3f} V")
print(f"Std: {std:.4f} V")

if cpk >= 1.67:
    print("Excellent process capability")
elif cpk >= 1.33:
    print("Acceptable process capability")
elif cpk >= 1.0:
    print("Marginal. Consider tightening process or widening limits")
else:
    print("Poor capability. Action required")
CpkMeaningDPMO (approx)
2.0Excellent0.002
1.67Very good0.6
1.33Good63
1.0Marginal2,700
0.67Poor45,500

Step 3: Validate Test Repeatability (Gauge R&R)

Test the same unit multiple times to measure your test system's repeatability.

repeatability_test.py
from tofupilot import TofuPilotClient

client = TofuPilotClient()

# Test the same unit 30 times
serial = "GRR-GOLDEN-001"
for i in range(30):
    vcc = measure_voltage()
    client.create_run(
        procedure_id="GRR-FUNCTIONAL-V3",
        unit_under_test={"serial_number": serial},
        run_passed=True,
        steps=[{
            "name": "Power Rail",
            "step_type": "measurement",
            "status": True,
            "measurements": [{
                "name": "vcc_3v3",
                "value": vcc,
                "unit": "V",
                "limit_low": 3.25,
                "limit_high": 3.35,
            }],
        }],
    )

After 30 runs, analyze the spread in TofuPilot:

MetricTargetMeaning
GR&R % of tolerance< 10%Excellent measurement system
GR&R % of tolerance10-30%Acceptable, monitor
GR&R % of tolerance> 30%Measurement system needs improvement

If your test measurement varies by 0.04V on the same unit and your tolerance is 0.10V, that's 40% GR&R. Your test is too noisy to reliably distinguish good from bad units.

Step 4: Optimize Test Limits

Use production data to optimize limits. The goal: catch real defects without rejecting good units.

Tightening Limits

If Cpk > 2.0 and you're seeing no false failures, your limits might be too loose. Tighter limits catch marginal units before they become field failures.

Widening Limits

If Cpk < 1.0 and you're seeing false failures (units that fail test but work fine in the field), your limits are too tight for your current process capability.

Dynamic Limits

Some teams use TofuPilot data to set limits based on the production distribution:

dynamic_limits.py
# Calculate limits from production data
mean = 3.310
std = 0.015

# 4-sigma limits for Cpk = 1.33
dynamic_low = mean - 4 * std   # 3.250
dynamic_high = mean + 4 * std  # 3.370

Step 5: Monitor at Scale

Once your test is validated, monitor it continuously. Scale introduces new variables:

  • Different operators
  • Different component lots across months
  • Fixture wear over thousands of cycles
  • Environmental changes (season, humidity)
  • Equipment calibration drift

TofuPilot's trend dashboards surface these changes. Set up monitoring for:

  1. FPY trend: Catch yield drops within hours
  2. Cpk trend: Catch process capability degradation within days
  3. Measurement mean shift: Catch drift before it causes failures
  4. Failure pareto changes: Catch new failure modes early

Validation Checklist

Before approving a test for production volume:

  • Measurement distributions are normal (or expected shape)
  • Cpk > 1.33 for all critical measurements
  • GR&R < 30% for all measurements
  • No false failures in the last 200 units
  • Test catches known defect modes (verified with known-bad units)
  • Test cycle time meets throughput requirements
  • All stations produce equivalent results (cross-station correlation)

More Guides

Put this guide into practice