When Theory Meets Scale: Mimir and Tempo in Production

Grafana
AI summary

Grafana engineers Marty and Marco present three production incidents from Grafana Cloud: the Mimir 'Query of Death' where 40KB PromQL regex queries caused 15-minute CPU spikes until regex unrolling achieved 97% skip rates; Tempo Parquet dictionary bloat causing OOM on 500-span traces with high-cardinality JSON over 7 days, solved with per-attribute dictionary control for 95% memory reduction; and Mimir queue starvation where slow store-gateway queries blocked fast ingester queries. The talk covers Mimir 3.0's Kafka-based read/write decoupling using WarpStream and is essential for engineers running observability infrastructure at scale.