When Theory Meets Scale: Mimir and Tempo in Production
Grafana Mimir Grafana Tempo Promql Distributed Tracing Time Series Database Query Optimization Parquet Apache Kafka Observability Go Performance Production Incidents Cloud Monitoring System Architecture
Grafana engineers Marty and Marco present three production incidents from Grafana Cloud: the Mimir 'Query of Death' where 40KB PromQL regex queries caused 15-minute CPU spikes until regex unrolling achieved 97% skip rates; Tempo Parquet dictionary bloat causing OOM on 500-span traces with high-cardinality JSON over 7 days, solved with per-attribute dictionary control for 95% memory reduction; and Mimir queue starvation where slow store-gateway queries blocked fast ingester queries. The talk covers Mimir 3.0's Kafka-based read/write decoupling using WarpStream and is essential for engineers running observability infrastructure at scale.