Where issue 05 left off
Issue 05 put Grafana OSS, InfluxDB OSS 2.7, and Telegraf on the $5.50/mo Hetzner VM that already ran the HiveMQ broker from issue 04. Five spindle variables from the Variscite i.MX 8M Plus carrier of issue 03 land in the broker as Sparkplug B frames, Telegraf parses them, InfluxDB stores them with 90-day retention, Grafana renders two dashboards and fires three threshold alerts.
The threshold alerts work. They also leave the question that every plant engineer asks of every observability stack: what about the failures that do not cross a fixed line, but show up as a pattern shift?
This issue puts a model layer on top. Not the Transformer-class IIoT model that the trade press writes about. A two-decade-old, single-file, scikit-learn Isolation Forest, trained on the same data the rest of the stack already collects, deployed on the same VM, and rendered on the same Grafana dashboard.
The model choice
Isolation Forest, introduced by Liu, Ting, and Zhou in 2008 (paper), is an unsupervised anomaly-detection algorithm that builds an ensemble of randomly partitioned trees and assigns each observation an anomaly score based on the average path length required to isolate it. Short paths indicate outliers. The algorithm trains and scores fast, requires no labeled data, and tolerates the moderate dimensionality of plant-floor telemetry without complaint.
The reasons to pick it over the alternatives on a plant-floor budget:
- Against One-Class SVM: trains in seconds rather than minutes on the 14-day window. Cost-relevant when nightly retraining is part of the loop.
- Against Autoencoder-based anomaly detection: no GPU, no TensorFlow install, no version-pinning fights. Runs inside the existing scikit-learn footprint.
- Against LSTM-based sequence models: orders of magnitude less data needed to converge. A 14-day window is enough for a spindle.
- Against the vendor "AI machine health" black box: the model is in 80 lines of Python on disk. The plant engineer can read the code, change the contamination parameter, and retrain the model in 6 seconds without a support ticket.
The honest case against Isolation Forest, also worth naming: it does not respect time order, so a slow drift over weeks looks indistinguishable from a single noisy reading. For drift, a complementary seasonal decomposition or STL residual layer is the right answer. Out of scope for issue 06, on the list for a later one.
The implementation
The training and scoring loop is a single Python file. Pulled from the test deployment, with site-identifying tags renamed:
# /opt/anomaly/score.py
# Trains nightly, scores live frames every 10 seconds.
import os, time, json
from datetime import datetime, timedelta, timezone
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS
from sklearn.ensemble import IsolationForest
import joblib, pandas as pd
INFLUX_URL = os.environ["INFLUX_URL"]
INFLUX_TOKEN = os.environ["INFLUX_TOKEN"]
INFLUX_ORG = os.environ["INFLUX_ORG"]
BUCKET = "telemetry"
MODEL_PATH = "/opt/anomaly/model.joblib"
FEATURES = ["rms_velocity", "drive_current", "spindle_speed", "bearing_temp", "envelope_band_3"]
ASSET_TAG = "machine_id"
def train_model(client):
query = f'''
from(bucket: "{BUCKET}")
|> range(start: -14d)
|> filter(fn: (r) => r._measurement == "spindle")
|> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")
'''
df = client.query_api().query_data_frame(query)
df = df.dropna(subset=FEATURES)
model = IsolationForest(n_estimators=200, contamination=0.01, random_state=42, n_jobs=2)
model.fit(df[FEATURES])
joblib.dump(model, MODEL_PATH)
return model
def score_loop(client, model):
write_api = client.write_api(write_options=SYNCHRONOUS)
while True:
query = f'''
from(bucket: "{BUCKET}")
|> range(start: -15s)
|> filter(fn: (r) => r._measurement == "spindle")
|> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")
'''
df = client.query_api().query_data_frame(query)
if df is None or df.empty:
time.sleep(10); continue
df = df.dropna(subset=FEATURES)
if df.empty:
time.sleep(10); continue
scores = model.score_samples(df[FEATURES])
for idx, row in df.iterrows():
p = Point("anomaly").tag(ASSET_TAG, row.get(ASSET_TAG, "unknown")) \
.field("score", float(-scores[idx])) \
.time(row["_time"], WritePrecision.NS)
write_api.write(bucket=BUCKET, org=INFLUX_ORG, record=p)
time.sleep(10)
if __name__ == "__main__":
client = InfluxDBClient(url=INFLUX_URL, token=INFLUX_TOKEN, org=INFLUX_ORG)
if os.path.exists(MODEL_PATH) and (time.time() - os.path.getmtime(MODEL_PATH)) < 86400:
model = joblib.load(MODEL_PATH)
else:
model = train_model(client)
score_loop(client, model)
A systemd unit runs the script as a long-lived service. A separate systemd timer calls the same script with a --retrain flag at 02:00 UTC nightly. The training pulls the prior 14 days, fits a fresh 200-tree forest at contamination 0.01, and saves the joblib artifact. The scoring loop reloads the artifact on the next iteration.
Memory footprint of the running process on the Hetzner CX22 instance: 142 MB. CPU under load: under 4% of one of the two vCPUs. Disk for the model artifact: 1.8 MB.
What the model catches that the thresholds do not
The 72-hour test window from issue 05 was re-run with the model layer in place. Two intentional fault injections were used, both representative of failure patterns the threshold rules from issue 05 cannot reach.
Fault A: subharmonic at 0.4x running speed, injected as a 6 mm/s sideband on a 30-Hz spindle for 90 seconds. The RMS velocity stays at 5.8 mm/s, under the 7.1 mm/s ISO 10816 Class II alarm threshold. Drive current shifts by 2.1% of nameplate, under the 10% threshold. Spindle speed and bearing temp are nominal. All three threshold alerts are silent.
The Isolation Forest score during the injection window rose from a baseline mean of 0.42 to a peak of 0.71. With the score-above-0.65-for-30-seconds alert configured, the model fires 12 seconds into the injection. The subharmonic is a recognized precursor of bearing cage fault and looseness, and is the kind of pattern shift a threshold rule on a single channel cannot reach without a per-channel custom rule per failure mode.
Fault B: drive-current ramp without RMS velocity change, injected as a 3.5% drive-current creep over 40 minutes on otherwise nominal spindle. Drive current peaks at 109.4% of nameplate, below the 110% threshold. The threshold rule does not fire.
The model score climbs from 0.40 to 0.58 over the same 40-minute window. The 0.65 alert threshold is not crossed. The model does flag the change as visibly anomalous on the Grafana time-series panel; the alert path does not fire. This is the failure mode of the model layer that practitioners report most often. The fix is not a different model, it is a trend-based alerting rule on the score itself: alert on a 30-minute moving average of the score crossing a band. With that rule added, the model layer catches Fault B at minute 27.
False positives, the part the trade press skips
The same 72-hour window, with no fault injection, produced six score excursions above the 0.65 line. Four were under 20 seconds and did not trip the 30-second persistence requirement, so no alert fired. Two were over 30 seconds and tripped the alert.
Of those two:
- One was a true tool-change event that the asset metadata had not flagged. The spindle decelerated, the cutting load dropped to zero, the bearing temp drifted three degrees. Real signal, wrong category. The fix is a tag on the asset stream indicating "tool change in progress," and a Telegraf processor that drops scoring during that tag window.
- One was a cooling-system cycle on the adjacent machine that the vibration sensor picked up as a low-frequency floor vibration. Real signal, wrong asset. The fix is a higher-pass filter at the carrier, before the data reaches the broker.
Both false positives are environmental, not algorithmic. The vendor SaaS that quoted $1,200 per asset per year claims a "tool change auto-detection" feature and a "cross-asset interference filter" in its marketing. Those features are not free to build. They are a real expense column on the open stack, paid in Telegraf processor plugin configuration time and floor-walking time, not in dollars per asset per year.
The vendor 'AI/ML' line item, priced honestly
The same vendor quote from issue 05 itemized the "AI-driven machine health" line at $480 per asset per year, inside the $1,200 total. The quote describes the line as "ensemble-learning anomaly detection with cross-asset correlation, automated tool-change detection, and adaptive thresholds."
The open implementation in this issue:
- Ensemble anomaly detection: the Isolation Forest is, by construction, an ensemble of 200 trees. Built.
- Adaptive thresholds: nightly retraining on the rolling 14-day window adapts the contamination band to seasonal load shifts. Built.
- Cross-asset correlation: not built. Would require either a multivariate model spanning assets or a graph-aware approach. Realistic build cost on the same VM: one engineer-week.
- Automated tool-change detection: not built. Pattern recognition on the asset stream against a tool-change signature. Realistic build cost: two engineer-days, plus integration with the CNC's M-code stream.
Engineering time at a fully-loaded internal rate of $200 per hour, one engineer-week plus two engineer-days is approximately $14,400 of one-time build, spread across a fleet of assets. The vendor's $480 per asset per year, multiplied across a fleet of 100 assets, is $48,000 per year recurring. Break-even on the build versus the SaaS subscription is roughly 110 days.
That math is the actual buying decision, and it is not the math the SaaS quote presents.
When the SaaS still wins
The open stack does not win every case. Four conditions where the vendor quote is the rational choice, in this issue's editorial view:
- The plant has no Python-literate engineering capacity and no plan to acquire it. The 80-line script is not maintenance-free. Someone has to read it when a retrain fails, when the influx schema shifts, when a feature gets added.
- The maintenance organization has SLA-bound on-call rotation expectations that the open stack's "the engineer who built it" model cannot meet at 02:00 on a Saturday.
- The SaaS includes a vibration-sensor hardware leasing program that subsidizes the recurring fee. Two of the three vendors in the quoting round did. The hardware lease, evaluated separately, changes the math.
- The plant insurer requires a third-party-certified condition-monitoring program for premium reduction. The open stack is not certified to any of the named programs.
None of those conditions applied to the test deployment. In a different plant, any one of them would.
What this stack costs at month four
The cumulative bill, four issues in, on the same $5.50/mo Hetzner CX22 instance:
- HiveMQ Community Edition: $0
- InfluxDB OSS 2.7: $0
- Grafana OSS 10.4: $0
- Telegraf: $0
- Scikit-learn + joblib: $0
- Caddy + Let's Encrypt: $0
- Hetzner VM: $5.50/mo
- Pushover one-time device fee: $5 once
- SMTP relay (existing): $0 marginal
- i.MX 8M Plus carrier (from issue 03): $240 one-time
Total monthly recurring: $5.50. Total one-time across the test deployment: $245.
Reference: the same workload, on the vendor quote, at one asset: $1,200 per year, no carrier hardware included.
What lands in issue 07
The cross-asset correlation that the SaaS quote claims, built on the open stack. Two assets, one shared dashboard, a PyOD ensemble that scores them jointly. Plus the retrofit case study from a reader who ran the issue 03 carrier and the issue 04 broker on a real spindle for 30 days, with photographs.