How I Found Machine Failure Patterns in Industrial Sensor Data — Predictive Maintenance EDA

#python #machinelearning #datascience #pandas

Predictive maintenance is one of the highest-impact applications of data science
in industry — and one of the least saturated. As a mechatronics engineer with
real experience in industrial plant maintenance, I wanted to build a project that
reflects what actual failure analysis looks like, not just a generic ML exercise.

This is the EDA (Exploratory Data Analysis) phase of a full predictive maintenance
pipeline I'm building as part of my Master's in Data Science & AI.

The Dataset

I used the AI4I 2020 Predictive Maintenance dataset from the UCI Machine Learning
Repository — 10,000 records of synthetic industrial sensor data with 5 labeled
failure types: Tool Wear Failure (TWF), Heat Dissipation Failure (HDF), Power
Failure (PWF), Overstrain Failure (OSF), and Random Failure (RNF).

The dataset is highly imbalanced: only 3.39% of records are failures (339 out of
10,000). That's realistic — in real plants, failures are rare events, and that
imbalance has direct implications for modeling (SMOTE or class_weight will be
required).

The Process

Beyond standard EDA, I engineered 3 physically grounded features:

temp_delta: process temperature minus air temperature — a proxy for thermal stress and heat dissipation capacity
power_W: mechanical power estimated as P = τ × ω — captures the actual operating regime
wear_rate: tool wear normalized by power — flags degraded tools running under demanding conditions

The full pipeline (load → clean → validate → features → Parquet export →
profiling report) runs reproducibly via a single uv run python main.py command.
Stack: Python 3.11, pandas, plotly, seaborn, pandera, loguru, pyarrow,
ydata-profiling, ruff, pytest.

Key Findings

Each failure type has a different dominant predictor — which is exactly what you'd
expect from a real industrial system:

HDF (Heat Dissipation Failure) is predicted by temp_delta. Normal operation
sits at a median of 9.8 K; HDF cases drop to 8.3 K (a 1.5 K reduction). The low
variance in the failure group (std = 0.28 K) suggests a near-deterministic
activation threshold around 8.5 K.

TWF (Tool Wear Failure) shows near-perfect separation: 100% of TWF failures
occur above 198 minutes of accumulated wear, while 75% of normal operation stays
below 162 min. This isn't gradual degradation — it's a threshold behavior, which
means a rule-based alert could catch most TWF cases before they happen.

PWF/OSF cluster in a danger zone at low RPM and high torque (1,401–1,700 rpm ×
61–80 Nm), where failure rates reach 71.4%. The Torque↔RPM correlation of -0.88
confirms the machine operates at approximately constant power, validating power_W
as a representative feature of the operating regime.

Product type also matters: type L fails at 3.92% vs 2.09% for type H — 1.88x
higher — independent of tool wear distribution.

What I Learned

The most important insight wasn't statistical — it was structural: failure
subtypes sum to 373 events while total Machine failure = 339, confirming
simultaneous multi-mode failures exist in the dataset. This has direct implications
for modeling: a single binary classifier isn't enough; a multi-label approach will
be needed.

I also learned that physically meaningful features outperform arbitrary
transformations. Being able to explain temp_delta or power_W in a technical
interview — with the physics behind them — is a stronger signal than a feature
with better correlation but no domain justification.

What's Next

This EDA is Phase 1 of a full PdM (Predictive Maintenance) system:
→ Advanced feature engineering
→ Multi-class failure classifier (Random Forest / XGBoost + SHAP)
→ RUL (Remaining Useful Life) prediction
→ FastAPI deployment + Docker
→ MLflow monitoring + drift detection

Full repo: GitHub

Project developed as part of the Master's in Data Science & AI at Evolve.