Building an IoT Platform That Handles 20k Messages/min
The Problem
When I joined Wacker Neuson, the IoT platform was already creaking under its own weight. We had construction equipment spread across Europe sending telemetry data — GPS pings, engine hours, fuel levels, fault codes — and the system processing it was struggling to keep up.
The goal: migrate 150,000 machines to a new platform without losing a single message.
Designing for Throughput
The first decision was architecture. We were pushing upward of 20,000 messages per minute during peak hours. A naive "write every message to SQL" approach would have buckled immediately.
The solution was a multi-stage pipeline:
- Ingest layer — Azure Event Hubs acting as a durable buffer
- Processing layer — Java microservices consuming from the event stream, normalising payloads
- Storage layer — time-series data into partitioned Azure SQL tables, enriched state into a document store
The key insight was separating hot path (real-time alerts, live map) from cold path (analytics, reports). Different SLAs, different optimisations.
The Migration Strategy
We couldn't just flip a switch. Machines in the field don't care about your deployment schedule.
Our approach was a shadow period: the old and new platform ran in parallel for six weeks. Every incoming message was written to both systems. We ran automated comparison jobs nightly to diff the outputs.
This caught three critical bugs:
- Timezone handling for machines in UTC+2 vs UTC+0
- A message deduplication edge case during cellular reconnects
- An off-by-one in fuel percentage parsing (vendor firmware quirk)
Performance Gains
After optimising the pipeline — connection pooling, batch inserts, smarter indexing — we cut processing latency by 70%.
The biggest win wasn't clever code; it was rethinking the data model. Instead of storing raw payloads and transforming on read, we normalised at write time. Reads became cheap. The dashboard went from 4-second loads to sub-200ms.
What I'd Do Differently
Schema versioning from day one. We had to retrofit this halfway through and it was painful. Machine firmware updates mean message formats change. Plan for it early.
Chaos engineering. We discovered our retry logic had a subtle bug only under network partition — something a properly configured fault injection test would have caught in week one.
The migration completed without incident. 150,000 machines, live, with zero data loss. Worth every late night.