Page MenuHomePhabricator

[HAProxy migration] HAProxy and VarnishKafka should produce compatible datasets
Closed, ResolvedPublic

Description

We need to get a sign off validation plan .

The validation and analysis as been implemented in jupyter notebooks.

Through December 2024 we run validation on multiple hourly haproxykafka samples generated on ulsfo and we report that for all fields datasets match with >99.9 coverage.
When we see differences, it's typically haproxykafka reporting more records (as expected). Traffic volumes (distributions of responses by status code) also match across the two systems.

Dataset were validated both for the refined dataset, as well as kafka topics. Details and steps to reproduce the analysis are available at https://gitlab.wikimedia.org/repos/data-engineering/webrequest-haproxy-logs

Note that due to the sensitive nature of this data (PII) specific output has been omitted.

Details

Other Assignee
JAllemandou
Related Changes in Gerrit:

Event Timeline

I've gone through more data validation (webrequest_text only, 2025-02-06T06:00) and found some things to discuss with @Fabfur.

What's expected:

  • The number of requests per host: between 1 and 10% more requests per host from HAproxy than VK, for all hosts.
  • The number of 301 responses is a lot bigger from HAproxy than VK - HAProxy send us tls-redirect rows.

What's not expected

  • The number of requests sending 20X and 30X except 301 response codes is lower on HAProxy than it is on VK, less than 0.5% (still represents almost 1M requests). The difference is stable across hosts and time.
  • Some requests' field don't match with a fairly big amount of volume (~10%). The two culprits are uri_host, which seems to be normalized/changed in VK, and accept_language that also seems to be normalized in VK.

Other questions:

  • The response_size returned by HAProxy is consistantly lower than the one from VK - Could it be compression?
  • When HAProxy logs a 400, the request is still passed to Varnish and we can get a different response - What does the client see? A Bad request response or the response Varnish has generated?
Ahoelzl renamed this task from Haproxy kafka and varnishkafka produce compatible datasets to [HAProxy migration] HAProxy and VarnishKafka should produce compatible datasets.Feb 20 2025, 6:07 PM
Ahoelzl updated Other Assignee, added: JAllemandou; removed: Antoine_Quhen.

Change #1136679 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] cache: use fqdn in syslog hostname

https://gerrit.wikimedia.org/r/1136679

Change #1136679 merged by Fabfur:

[operations/puppet@production] cache: use fqdn in haproxykafka hostname

https://gerrit.wikimedia.org/r/1136679