- Since 2024-09-04 we can see drops in pageview data on turnilo: https://w.wiki/BV7M
- These drops can also be seen on sampled webrequest data in Druid (we don't have 90 days, only showing at the last 30 days): https://w.wiki/BV7R
- It affects both webrequest text and upload sources, but a lot less upload: https://w.wiki/BV8R
- The problem doesn't come from the source nor the stream trqansportation layer (kafka) as streaming data show no drop (streaming sampled webrequest in Druid): https://w.wiki/BV7X
- Airflow jobs show no failure, spark jobs show no failure.
- Since a few weeks we have regular alerts about the HDFS RPC queue being overwhelmed: https://grafana.wikimedia.org/goto/PcT_YCzNg?orgId=1
Incident report drafting: https://docs.google.com/document/d/1sJ8f1FHB-gLom5Po0vLjoTdIY5KlaJUwPdClrBY7DtE/edit?tab=t.0#heading=h.2ro4l2vgyh3x
(Will eventually be moved to wikitech)