Cross-asset correlation on the open stack, and what it actually buys: fewer false alarms, not more catches

What issue 06 left unbuilt

Issue 06 put a single-asset Isolation Forest on the $5.50/mo Hetzner VM that already ran the Grafana, InfluxDB OSS, and Telegraf stack from issue 05. The model scored one spindle, wrote the anomaly score back to InfluxDB as its own measurement, and Grafana rendered it next to the ISO 10816 threshold alerts.

The same issue priced a vendor condition-monitoring quote against the open implementation, line by line. Three of the four claimed model features mapped onto something the open stack already did or could do cheaply. One did not: cross-asset correlation. The quote describes it as part of an ensemble-learning anomaly detection with cross-asset correlation package, and issue 06 estimated the open-stack build at roughly one engineer-week without committing to what the feature would actually deliver.

This issue builds it, and the build changes the editorial read on what the feature is.

The wrong mental model, and the right one

The marketing framing of cross-asset correlation implies a wider net: a model that watches many machines and catches faults that single-asset monitoring misses because the fault only shows up as a relationship between two assets. That class of fault exists. A shared lubrication loop, a common power bus, a mechanical coupling between two machines on one line can produce a fault that lives in the correlation and nowhere else. It is also rare on a plant floor where assets are mechanically independent, which describes most discrete-manufacturing cells.

The far more common cross-asset signal is the opposite of a fault. It is the plant-wide disturbance that hits every asset at once and means nothing about any single machine. The two false positives that issue 06 documented were both of this kind. A cooling-system cycle on an adjacent machine showed up on the spindle's vibration sensor as a low-frequency floor vibration. A building-wide event moved both the monitored spindle and its neighbor together. The single-asset model saw an anomaly because, from inside one asset's feature space, the disturbance is genuinely anomalous. It had no way to know the same disturbance hit the asset next to it at the same instant.

Cross-asset correlation, built correctly, is a common-mode rejection filter. It is the same principle a differential amplifier uses to reject noise present on both inputs. When both assets' scores spike together inside a short window, the joint model treats the excursion as environmental and suppresses the alert. When one asset's score spikes and its neighbor's does not, the excursion survives as asset-specific and the alert fires. The feature's value is fewer false alarms rather than more catches. That is worth building, but it is not the capability the quote describes when it sells the feature.

The second asset

The test cell got a second Variscite i.MX 8M Plus carrier on a second spindle, identical to the issue 03 build, publishing the same five Sparkplug B variables to the same broker under a distinct machine_id tag. Telegraf parses both streams without change. InfluxDB stores both under the spindle measurement, separated by tag. Nothing in the pipeline below the model layer needed modification, which is the payoff of the Sparkplug B topic namespace doing the addressing.

The second spindle is a real machine on the same line as the first, roughly four meters away, on the same power distribution panel and the same building HVAC zone. That physical relationship is what makes it useful as a common-mode reference. A second asset on a different floor of a different building would share no environmental disturbances and would tell the model nothing about the first asset's false positives.

The joint model

The model layer moves from scikit-learn's single Isolation Forest to PyOD, the Python Outlier Detection library, which provides a uniform interface across detectors and a documented way to combine them. The detector is ECOD (Empirical Cumulative distribution Outlier Detection), chosen because it is parameter-free, deterministic, and fast enough to retrain nightly on the joint window without tuning a contamination rate per asset.

The joint frame is the key design choice. Rather than scoring each asset independently and comparing the two scores after the fact, the model scores each asset against its own baseline and a separate common-mode detector watches whether both scores move together. The implementation keeps two per-asset ECOD models and one correlation gate:

# /opt/anomaly/joint_score.py
# Per-asset ECOD models plus a common-mode rejection gate.

import os, time
from datetime import datetime, timezone
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS
from pyod.models.ecod import ECOD
import joblib, pandas as pd, numpy as np

INFLUX_URL = os.environ["INFLUX_URL"]
INFLUX_TOKEN = os.environ["INFLUX_TOKEN"]
INFLUX_ORG = os.environ["INFLUX_ORG"]
BUCKET = "telemetry"
FEATURES = ["rms_velocity", "drive_current", "spindle_speed", "bearing_temp", "envelope_band_3"]
ASSETS = ["spindle_a", "spindle_b"]
MODEL_DIR = "/opt/anomaly"
CM_WINDOW = 15          # seconds: both assets must spike inside this window to be common-mode
CM_FACTOR = 0.7         # if neighbor score is at least this fraction of self score, treat as common-mode

def train(client):
    for asset in ASSETS:
        q = f'''
          from(bucket: "{BUCKET}") |> range(start: -14d)
            |> filter(fn: (r) => r._measurement == "spindle" and r.machine_id == "{asset}")
            |> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")
        '''
        df = client.query_api().query_data_frame(q).dropna(subset=FEATURES)
        model = ECOD()
        model.fit(df[FEATURES])
        joblib.dump(model, f"{MODEL_DIR}/ecod_{asset}.joblib")

def latest(client, asset):
    q = f'''
      from(bucket: "{BUCKET}") |> range(start: -15s)
        |> filter(fn: (r) => r._measurement == "spindle" and r.machine_id == "{asset}")
        |> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")
    '''
    df = client.query_api().query_data_frame(q)
    if df is None or df.empty:
        return None
    df = df.dropna(subset=FEATURES)
    return None if df.empty else df.iloc[[-1]]

def score_loop(client, models):
    write_api = client.write_api(write_options=SYNCHRONOUS)
    while True:
        frames = {a: latest(client, a) for a in ASSETS}
        scores = {}
        for a in ASSETS:
            if frames[a] is None:
                continue
            scores[a] = float(models[a].decision_function(frames[a][FEATURES])[0])
        for a, raw in scores.items():
            others = [scores[o] for o in ASSETS if o != a and o in scores]
            common_mode = any(o >= CM_FACTOR * raw for o in others) and raw > 0
            adjusted = 0.0 if common_mode else raw
            p = Point("anomaly_joint").tag("machine_id", a) \
                .field("raw_score", raw) \
                .field("adjusted_score", adjusted) \
                .field("common_mode", int(common_mode)) \
                .time(datetime.now(timezone.utc), WritePrecision.NS)
            write_api.write(bucket=BUCKET, org=INFLUX_ORG, record=p)
        time.sleep(10)

if __name__ == "__main__":
    client = InfluxDBClient(url=INFLUX_URL, token=INFLUX_TOKEN, org=INFLUX_ORG)
    need_train = not all(
        os.path.exists(f"{MODEL_DIR}/ecod_{a}.joblib") and
        (time.time() - os.path.getmtime(f"{MODEL_DIR}/ecod_{a}.joblib")) < 86400
        for a in ASSETS
    )
    if need_train:
        train(client)
    models = {a: joblib.load(f"{MODEL_DIR}/ecod_{a}.joblib") for a in ASSETS}
    score_loop(client, models)

The alert path now keys on adjusted_score rather than the raw score. The raw score is still written, so the Grafana panel can show both the per-asset anomaly and the moments the common-mode gate suppressed an alert. The gate is deliberately simple: a score that the neighbor mirrors above a 0.7 ratio inside the same scoring cycle is zeroed. The threshold is a starting point, tuned on the test cell, and is the first parameter a different plant would adjust.

Memory footprint of the joint process on the Hetzner CX22 instance: 168 MB, up from the 142 MB the single model used. CPU under load stays under 6% of one of the two vCPUs. The second carrier adds nothing to the VM bill.

What the joint model did to the issue 06 false positives

The two environmental false positives from issue 06 were re-created on the two-asset cell.

The cooling-system cycle was reproduced by running the adjacent machine's cooling pump through a duty cycle while both spindles ran nominal. In the single-asset configuration from issue 06, the monitored spindle's score crossed the 0.65 alert line and, because the excursion lasted over the 30-second persistence window, fired an alert. On the two-asset cell, both spindles registered the floor-vibration bleed-through within the same scoring cycle, the neighbor's raw score sat at 0.81 of the monitored spindle's, the common-mode gate engaged, the adjusted score went to zero, and no alert fired. The Grafana panel shows the raw-score excursion and a shaded common-mode band underneath it, which is the right outcome: the signal is visible for review, the page does not go out.

The building-wide event, reproduced as a forklift impact transmitted through the slab, produced the same result. Both assets spiked together, the gate suppressed both, no alert fired.

A single-asset fault was then injected on spindle A alone, the subharmonic sideband from issue 06's Fault A. Spindle A's raw score rose to 0.71. Spindle B stayed at its 0.40 baseline. The neighbor ratio was 0.56, below the 0.7 common-mode threshold, the gate did not engage, the adjusted score carried the full 0.71, and the alert fired 14 seconds into the injection, the same latency the single-asset model produced. Cross-asset correlation cost the genuine fault nothing and removed both false positives.

Where the common-mode gate fails

The gate is not free of failure modes, and the honest ones are worth naming.

A genuine fault that happens to coincide with a plant-wide disturbance gets suppressed. If a real bearing fault on spindle A begins in the same 15-second window as a forklift impact that moves both assets, the gate reads the coincidence as common-mode and zeroes the real fault's score for that cycle. The probability is low and the fault persists past the disturbance, so the next cycle catches it, but the latency penalty is real during the overlap.

A correlated fault, the rare case the marketing emphasizes, is actively suppressed by this gate rather than caught. If two mechanically coupled assets fail together because of the shared coupling, the gate reads the joint movement as environmental and suppresses it. A plant with genuinely coupled assets needs the opposite logic, a model that scores the relationship and alerts when the assets move together in a way they normally do not. That is a different build, PyOD's feature-bagging or a copula-based detector on a stacked two-asset frame, and it is out of scope here because the test cell's assets are mechanically independent. A plant should pick the gate that matches its physical reality, and the two are not interchangeable.

The 0.7 ratio and the 15-second window are tuned on one cell. A plant with more environmental cross-talk would loosen the ratio and widen the window, trading more false-positive suppression for more risk of suppressing a coincident real fault. There is no universal setting.

The vendor line item, repriced

Issue 06 estimated cross-asset correlation at roughly one engineer-week and left the deliverable undefined. With the build done, the estimate holds and the deliverable is now specific. The per-asset ECOD models and the common-mode gate are under 90 lines of Python. The engineering time went almost entirely to deciding what the feature should do, not to writing it, which is the usual shape of this kind of work.

The second carrier is a $240 one-time cost, but it is not a cost of the correlation feature. The second asset needed monitoring regardless. The correlation capability is a free byproduct of monitoring two assets that share an environment, and it costs only the configuration time to set the gate. The vendor quote billed cross-asset correlation as a recurring premium inside the per-asset annual fee. On the open stack it is the cheapest feature in the series so far, because the data it needs is already being collected for another reason.

What this stack costs at month five

The cumulative bill, five issues in, on the same Hetzner CX22 instance:

HiveMQ Community Edition, InfluxDB OSS, Grafana OSS, Telegraf, scikit-learn, PyOD, Caddy + Let's Encrypt: $0
Hetzner VM: $5.50/mo
i.MX 8M Plus carrier, spindle A (issue 03): $240 one-time
i.MX 8M Plus carrier, spindle B (this issue): $240 one-time
Pushover device fee: $5 once

Total monthly recurring: $5.50. Total one-time across the two-asset test deployment: $485.

The vendor quote, at two assets, is $2,400 per year recurring, carrier hardware not included.

The reader retrofit case study

Issue 06 told readers the 30-day retrofit case study would land here. The retrofit, a reader running the issue 03 carrier and the issue 04 broker on a production spindle, is mid-window. The reader installed the carrier on day 9 of the planned 30-day run and the data collection is not complete. Publishing a partial run with the first nine days of telemetry would misrepresent it as a finished case study, so the full retrofit, with the reader's photographs and the complete 30-day comparison against the plant's existing condition-monitoring contract, publishes when the window closes. The intermediate state is not interesting enough to summarize and not complete enough to conclude from.

What lands in issue 08

The tool-change auto-detection line item, the last unbuilt feature from the issue 06 vendor comparison. A signature detector on the asset stream against the CNC's M-code feed, so the model stops scoring during a known tool change instead of flagging it as the issue 06 false positive that turned out to be a real, miscategorized event. Plus the completed reader retrofit if the 30-day window has closed by then.

Issue 06 · Issue 05 · Issue 04 · Issue 03 · Archive