Tool-change auto-detection, and the case for asking the controller instead of guessing from the signal

The feature issue 06 left for last

Issue 06 compared a vendor condition-monitoring quote against the open stack feature by feature. Four model capabilities were on the quote. Three are now built: the single-asset Isolation Forest from issue 06, the cross-asset common-mode gate from issue 07, and the ISO 10816 threshold alerts from issue 05. The fourth was tool-change auto-detection.

The need for it is concrete and was logged, not predicted. Issue 06 documented an anomaly the model flagged that turned out to be a planned tool change. The transient was real. A tool change drives a large excursion in spindle drive current as the spindle stops, the automatic tool changer cycles, and the spindle restarts. From inside the model's feature space the excursion is genuinely anomalous, because it is. It is just not a fault. The model had no way to know the machine was in the middle of an operation it performs hundreds of times a shift on purpose.

The fix is to tell the model when a tool change is happening so it stops scoring for the duration. There are two ways to know, and the difference between them is the whole point of this issue.

Guessing from the signal, and asking the machine

The signal-only approach learns what a tool change looks like and watches for it. The transient has a repeatable shape: spindle decel, the tool changer cycle, spindle accel back to the commanded speed. That shape is a template, and matching a known template against a noisy stream is a solved problem. The matched filter is the optimal linear detector for a known signal in additive noise, and it is one cross-correlation away in scipy.signal. When the template's deformation in time matters more than its amplitude, dynamic time warping handles the case where the same tool change runs slightly faster or slower, and dtaidistance or tslearn provide it. A change-point detector such as ruptures can find the boundaries of the transient without a template at all.

All of this works. It also inherits the failure mode of every blind detector: it fires on whatever resembles the template and stays silent on the tool change that does not match, the one with a sticky tool pot or a longer reach that runs half a second long. The detector is inferring an event the machine could simply report.

Because the machine can report it. The CNC issues the tool change. It executes the M06 block, it knows the active tool number, and it knows whether the program is executing or in a tool-change dwell. M-codes are the machine and miscellaneous functions of the program, distinct from the G-code motion, and M06 is the one that means change the tool. The controller holds the ground truth the signal-only detector is trying to reconstruct. The open standard for getting that ground truth out of the controller is MTConnect.

What MTConnect exposes

MTConnect is an open, royalty-free standard for machine data. An adapter translates a controller's native interface into the standard vocabulary, and an agent aggregates the data items and serves them over HTTP as structured documents. The device information model defines the data items that matter for this build: EXECUTION reports whether the program is ACTIVE, READY, or in a wait state, BLOCK reports the block of code currently executing, PROGRAM reports the running program, and the tool entities track which tool the automatic tool changer has loaded. A tool change shows up as the loaded tool number transitioning and the execution state passing through its tool-change wait. Either is a clean event boundary, and neither requires parsing G-code on the edge.

The agent runs as a Docker image on the same edge tier as the Variscite i.MX 8M Plus carriers from issue 03, or on the Hetzner CX22 VM if the controller is reachable from there. The agent itself can publish to MQTT, which is the integration that matters: the tool-change event lands on the same broker as the Sparkplug B sensor stream. MTConnect does not emit Sparkplug B natively, so the edge republishes the relevant data items as a Sparkplug B context channel under the asset's machine_id, alongside the five process variables the carrier already publishes. The model reads one broker.

When the controller will not talk

Not every controller exposes MTConnect. An older FANUC without an MTConnect adapter still answers FOCAS calls: cnc_rdseqnum returns the sequence number of the executing block, cnc_rdprogline2 returns the block text so the M06 is visible, and the tool offset reads identify the active tool. Siemens SINUMERIK exposes program and tool state over OPC UA, and HEIDENHAIN DNC reports tool changes with the old and new tool number directly. A small adapter on any of these produces the same context channel the MTConnect agent does.

The machine that exposes nothing, an air-gapped controller with no open interface and no spare I/O point to wire a tool-change relay into, is the only case where the signal-only detector earns its place. On that machine the matched filter is not the inferior choice. It is the available choice.

The build

The context channel adds one Sparkplug B metric per asset, tool_change_active, written 1 while the controller reports a tool change in progress and 0 otherwise, plus tool_number for the loaded tool. The scoring loop from issue 07 gates on it. The change to joint_score.py is small, because the architecture from issue 07 already keys alerts off an adjusted score rather than the raw score.

# joint_score.py, gating addition (deltas to the issue 07 loop)
# A controller-reported tool change zeros the adjusted score and holds
# the gate open for GATE_TAIL seconds after the change clears, to cover
# the spindle ramp back to commanded speed.

GATE_TAIL = 8  # seconds to keep scoring suppressed after tool_change_active -> 0

def tool_change_state(client, asset):
    q = f'''
      from(bucket: "{BUCKET}") |> range(start: -15s)
        |> filter(fn: (r) => r._measurement == "context" and r.machine_id == "{asset}"
                              and r._field == "tool_change_active")
        |> last()
    '''
    df = client.query_api().query_data_frame(q)
    if df is None or df.empty:
        return False
    return bool(int(df["_value"].iloc[-1]))

# inside score_loop, replacing the adjusted-score assignment:
for a, raw in scores.items():
    others = [scores[o] for o in ASSETS if o != a and o in scores]
    common_mode = any(o >= CM_FACTOR * raw for o in others) and raw > 0
    in_tool_change = tool_change_state(client, a)
    if in_tool_change:
        gate_until[a] = None  # cleared below when the tail expires; see note
    gated = in_tool_change or _within_tail(a, GATE_TAIL)
    adjusted = 0.0 if (common_mode or gated) else raw
    p = Point("anomaly_joint").tag("machine_id", a) \
        .field("raw_score", raw) \
        .field("adjusted_score", adjusted) \
        .field("common_mode", int(common_mode)) \
        .field("tool_change_gate", int(gated)) \
        .time(datetime.now(timezone.utc), WritePrecision.NS)
    write_api.write(bucket=BUCKET, org=INFLUX_ORG, record=p)

The raw score is still written, so the Grafana panel shows the tool-change transient in full with a shaded gate band underneath it. The transient is visible for anyone reviewing the data, and the page does not go out. This is the same display choice issue 07 made for common-mode suppression, and for the same reason: suppressing an alert is not the same as hiding the data.

The matched-filter fallback, for the controller that will not talk, is a separate detector that runs only on those assets. A reference tool-change transient is captured once from the drive-current channel, normalized, and correlated against the live stream with scipy.signal.correlate. A correlation peak above a fitted threshold sets the same tool_change_active flag the controller would have set, and the gate logic downstream is identical. The detector does not know it is inferring rather than reading. The rest of the pipeline does not need to.

Where gating any planned event fails

Gating is a deliberate blind spot, and the honest failure mode is the one common to every planned-event suppression: a real fault that begins inside the gate window is suppressed until the window closes. If a bearing starts to fail in the second the tool changer is cycling, the model scores nothing during the gate and the GATE_TAIL seconds after it. The fault persists past the tool change, the next clean scoring cycle catches it, and the latency penalty is the gate duration plus the tail. On a tool change that is a few seconds. The penalty is bounded and small, but it is real, and a plant that changes tools constantly is gating a meaningful fraction of its runtime.

The GATE_TAIL exists because the spindle ramp back to commanded speed is itself a transient the model would flag, so the gate has to outlast the controller's tool-change-complete signal by enough to cover the spindle accel. Eight seconds is tuned to the test cell's spindle. A higher-inertia spindle needs a longer tail, and the tail is the second parameter a different machine adjusts after the matched-filter threshold.

Clock alignment is the quiet failure. The controller's tool-change timestamp and the sensor stream's timestamp come from different clocks, and if they drift the gate opens early or late relative to the transient it is meant to cover. The edge stamps both channels from the same carrier where possible, which removes the skew. Where the MTConnect agent runs on the VM and the sensor on the carrier, the two clocks need NTP discipline, and a drift of more than a second starts to leak transient into the scored window.

The vendor line item, repriced

Tool-change auto-detection was the last open line on the issue 06 quote, where it was bundled into the per-asset annual fee as part of the intelligent event classification the SaaS markets. On the open stack the controller-fed path is free software end to end: the MTConnect agent is open source, it runs in the Docker tier already on the edge, and the gate is a dozen lines added to a loop that already existed. There is no new hardware, because reading the controller's existing tool-change report costs nothing the machine was not already doing.

The only real cost is access. A controller with MTConnect or FOCAS or OPC UA already on the network is a configuration afternoon. A controller that exposes nothing forces the matched-filter fallback, which is more engineering and a worse detector, and that cost is a property of the machine, not of the stack. The vendor faces the same wall: their box cannot read a controller that refuses to be read either, and on those machines they are running a signature detector of their own behind the marketing.

What this stack costs, eight issues in

No hardware was added this issue. The cumulative bill is unchanged from issue 07:

HiveMQ Community Edition, InfluxDB OSS, Grafana OSS, Telegraf, scikit-learn, PyOD, MTConnect agent, Caddy + Let's Encrypt: $0
Hetzner CX22 VM: $5.50/mo
i.MX 8M Plus carrier, spindle A (issue 03): $240 one-time
i.MX 8M Plus carrier, spindle B (issue 07): $240 one-time
Pushover device fee: $5 once

Total monthly recurring: $5.50. Total one-time across the two-asset test deployment: $485.

All four model features from the issue 06 vendor comparison are now built on that bill. The vendor quote, at two assets, remains $2,400 per year recurring with carrier hardware not included.

The reader retrofit, still open

Issue 07 said the 30-day reader retrofit case study would land here if the window had closed. It has not. The reader installed the issue 03 carrier on day 9 of the planned run as of issue 07, which puts the run near its midpoint now, not its end. Publishing a half-finished window as a case study would misrepresent it. The full retrofit, with the reader's photographs and the complete comparison against the plant's existing condition-monitoring contract, publishes when the 30 days close and not before.

What lands in issue 09

The series has built every model feature the issue 06 quote charged for. Issue 09 turns from the model to the thing the model has been writing to since issue 05: retention and downsampling on InfluxDB OSS, and what happens to a year of full-rate vibration data on a 40 GB disk. The open question the series has deferred five times is finally forced, because the test cell's bucket is filling. Plus the completed reader retrofit if the window has closed.

Issue 07 · Issue 06 · Issue 05 · Issue 04 · Archive