Sense Reply
Sense Reply · Fall 2025 · BDD competition

Securing Grid Reliability Through Data Integrity

Data PreparationAnomaly DetectionTime-SeriesFeature Engineering

Context

As part of Sense Reply's Business Decision Day at Albert School (December 2025), our team tackled a critical data quality challenge: 2017 substation IoT data from a field network was unfit for anomaly detection modeling due to systematic failures in the SCADA and IoT Gateway architecture. Only 86.7% of readings were usable.

Task

Execute the data preparation phase — cleaning, reconstructing, and enriching raw substation energy consumption data to create a pristine, model-ready dataset capable of supporting high-precision anomaly detection.

My Contribution

Diagnosed four root causes tied to the SCADA/IoT Gateway architecture: duplicate timestamps from re-transmitted packets (1,493), missing intervals from network outages with no store-and-forward buffer (2,620 gaps), unrealistic zero-kilowatt readings from Zone 2 sensor failures (4,230 rows), and ~6,800 missing sensor values. Led the 5-step cleaning pipeline: deduplication → master timeline reconstruction → zero-value correction → linear interpolation → feature engineering. Proposed architectural fixes including edge-level deduplication, store-and-forward capability, and proactive sensor health monitoring.

Outcome

Delivered a 100% complete, 52,416-record dataset (up from 51,289 at 86.7% completeness) with 22+ engineered features — ready for time-series anomaly detection model training. Identified Zone 3 as the primary anomaly source and proposed three architectural improvements to prevent future data quality failures.

Key Insights

  • 1,493 duplicate timestamps traced to a single architectural flaw: the IoT Gateway lacked deduplication logic and re-transmitted packets during network instability
  • 2,620 missing intervals (5% data loss) stemmed from the Gateway's absence of store-and-forward capability during network outages
  • 4,230 zero-kilowatt readings in Zone 2 are physically impossible for an active substation — treating them as normal would corrupt the anomaly model's baseline of 'normal' behavior
  • Zone 3 identified as the single root of behavioral anomalies across the full 2017 dataset
  • Feature engineering expanded the dataset from 9 raw columns to 22+ predictive features, including lag variables (10m, 1hr, 24hr) and weekday/weekend behavioral splits

Skills Applied

Data CleaningFeature EngineeringTime-Series AnalysisPythonExcelAnomaly DetectionIoT ArchitectureSCADA Systems

Presentation Deck