Background
Prior to T392880 and T351117, Varnish applied hostname normalization. This was (presumably) effective and was (presumably) observable in varnishkafka output:
- Kafkacat: webrequest_frontend_text
- Hadoop: wmf.webrequest and wmf.pageview_actor datasets
- Turnilo: webrequest_sampled_live
Problem
This isn't the case today. This came up during T405429 where I found some numbers didn't add up.
Turnilo: https://w.wiki/Fkf7
Hadoop
SELECT COUNT(*) _count, uri_host FROM wmf.pageview_actor WHERE year=2025 AND month=10 AND day=19 AND uri_host!=LOWER(uri_host) GROUP BY uri_host ORDER BY _count DESC;
Commons.wikimedia.org EN.m.wikipedia.org commons.Wikimedia.org EN.Wiktionary.org En.Wiktionary.org …
