Page MenuHomePhabricator

Haproxy kafka and varnishkafka produce compatible datasets
Open, Needs TriagePublic

Description

We need to get a sign off validation plan .

The validation and analysis as been implemented in jupyter notebooks.

Through December 2024 we run validation on multiple hourly haproxykafka samples generated on ulsfo and we report that for all fields datasets match with >99.9 coverage.
When we see differences, it's typically haproxykafka reporting more records (as expected). Traffic volumes (distributions of responses by status code) also match across the two systems.

Dataset were validated both for the refined dataset, as well as kafka topics. Details and steps to reproduce the analysis are available at https://gitlab.wikimedia.org/repos/data-engineering/webrequest-haproxy-logs

Note that due to the sensitive nature of this data (PII) specific output has been omitted.