Since 2026-03-31T13:00, the webrequest validation check reports warning or error for every hour for the text cache.
When looking at hosts, the problem seems to be coming from the drmrs, ulsfo and magru datacenters (codfw also has problematic percentages, but the traffic is very small).
The problem seems related to bad requests being received by the caches.
I have gathered data in this spreadsheet: https://docs.google.com/spreadsheets/d/1aDdhoeruzzWXiIe1gHwy0spj1Odmqd6P1hZQe8TmyFY/edit?usp=sharing
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Update webrequest validation algorithm | analytics/refinery | master | +2 -3 | |
| cache::haproxy: rename deprecated instructions in haproxy 3.2 | operations/puppet | production | +5 -0 |
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| Update webrequest validation job thresholds | repos/data-engineering/airflow-dags!2192 | joal | update_webrequest_validation_thresholds | main |
Related Objects
- Mentioned In
- T422033: Investigate raise in Invalid HAProxyKafka messages in esams
- Mentioned Here
- T421402: Upgrade HAProxy to version 3.2
Event Timeline
This could be related to upgrade to HAProxy 3.2 (T421402) that started on the drmrs datacenter, we'll investigate if the sequence-id is now generated differently
Change #1266301 had a related patch set uploaded (by Fabfur; author: Fabfur):
[operations/puppet@production] cache::haproxy: rename deprecated instructions in haproxy 3.2
This is most probably due to a deprecation in haproxy configuration directives https://www.haproxy.com/blog/reviewing-every-new-feature-in-haproxy-3-1#deprecation, especially
option accept-invalid-http-request option accept-invalid-http-response
Change #1266301 merged by Fabfur:
[operations/puppet@production] cache::haproxy: rename deprecated instructions in haproxy 3.2
The patch has been applied to all impacted hosts, thanks @JAllemandou for spotting this!
This is not accurate, applying this config change didn't solve the issue. An interesting change in behavior from 3.0 to 3.2 and that could be related is that after the upgrade, haproxykafka number of invalid messages dropped to 0: https://grafana.wikimedia.org/goto/afi95ztecv20wd?orgId=1
An interesting change in behavior from 3.0 to 3.2 and that could be related is that after the upgrade, haproxykafka number of invalid messages dropped to 0: https://grafana.wikimedia.org/goto/afi95ztecv20wd?orgId=1
This confirms another change we're experiencing with the new version:
- Version 3.0: the number of problematic sequence-id rows is ~2x the number of BADRED rows
- Version 3.2: the number of problematic sequence-id rows is equvalent to the number of BADREQ rows
--> The "lost" sequence-id rows were going in Invalid messages!
We talked last Thursday with @Fabfur, and he told me we should adapt our validation mechanism, because the change you experience was valid from your point of view.
Can we get a formal confirmation this is the way to go please ? :)
It looks like the root cause is MEDIUM: log/session: handle embryonic session log within sess_log(). A change introduced in HAProxy 3.1 as part of the work done to introduce the log profiles feature.
From the commit message:
Also, thanks to this change, log-format-sd will now be taken into account
for legacy embryonic session logging.
All these new BADREQ rows are actually TLS handshake errors that under 3.0 were being discarded by haproxykafka cause the log message wasn't a valid RFC5424 message.
Nice finding, thanks!
My hypotheses about the following bugs being backported to 3.0.12 is now obsolete but I'll write it down anyway for future reference:
MINOR: http: add a function to validate characters of :authority BUG/MEDIUM: h2/h3: reject some forbidden chars in :authority before reassembly BUG/MEDIUM: h1/h2/h3: reject forbidden chars in the Host header field
(These 3 patches has been introduced in 3.2 and backported also to 3.0.12. Considering official Debian Trixie repositories are on 3.0.11, it's plausible that this behavior isn't showed on the non-upgraded hosts)
I would say that we can proceed with upgrading HAProxy to 3.2 on the remaining datacenters
Thanks for confirming the invalid-events change @Vgutierrez.
There still is something I don't understand:
- The pattern we see in v3.0 seems to show that haproxykafka generates a sequence-id even when it discards a log as invalid, as we see 2x the number of invalid seqId per BADREQ.
- With v3,2 we see a big increase in BADREQ that don't correlate with the number of invalid-seqenceId we were reporting previsouly: there are way more.
Could it be possible that my assumption that haproxykafka generates a seqId even for invalid messages is partially true, as in, only a small subset of invalid messages generate seqId?
Yes, sequence numbers are enerated by haproxy itself, even if it results in a SSL handshake error where the sequence number doesn't reach haproxykafka when using haproxy 3.0 because the log format is ignored for that kind of error.
With 3.2, the log format is honored and haproxykafa is seeing all these sequence numbers
This perfectly make sense.
But: Since in v3.0 we had 2x more invalid-seqId than BADREQ, I would have expected with v3.2 to get roughly 2x more BADREQ, assuming the newly valid SSL handshake logs are generated as such.
The concern I have is that we see 10x more, not 2x more.
I've replicated locally a SSL handshake failure using haproxy with log-format-sd %{+E}o\ [haproxykafka@0\ %[capture.req.hdr(0),json(ascii)]|%HPO|%HQ|%rt]
the sequence number (%rt) is logged as the last field before ],
with haproxy 3.2 it gets logged like this:
haproxy | <134>1 2026-04-08T09:21:07.178911+00:00 - haproxy 7 - [haproxykafka@0 -|<BADREQ>|<BADREQ>|0] 172.24.0.1:45128 [08/Apr/2026:09:21:04.950] https-in/1: Connection closed during SSL handshake haproxy | <134>1 2026-04-08T09:21:12.764427+00:00 - haproxy 7 - [haproxykafka@0 -|<BADREQ>|<BADREQ>|1] 172.24.0.1:44526 [08/Apr/2026:09:21:12.251] https-in/1: Connection closed during SSL handshake haproxy | <134>1 2026-04-08T09:21:15.979792+00:00 - haproxy 7 - [haproxykafka@0 -|<BADREQ>|<BADREQ>|2] 172.24.0.1:44536 [08/Apr/2026:09:21:15.705] https-in/1: Connection closed during SSL handshake haproxy | <134>1 2026-04-08T09:21:17.221063+00:00 - haproxy 7 - [haproxykafka@0 -|<BADREQ>|<BADREQ>|3] 172.24.0.1:44542 [08/Apr/2026:09:21:16.375] https-in/1: Connection closed during SSL handshake
If I use the very same configuration against haproxy 3.0 I get this:
haproxy | <134>1 2026-04-08T09:37:03.868168+00:00 - haproxy 8 - - 172.24.0.1:45190 [08/Apr/2026:09:37:03.165] https-in/1: Connection closed during SSL handshake haproxy | <134>1 2026-04-08T09:37:06.778075+00:00 - haproxy 8 - - 172.24.0.1:45200 [08/Apr/2026:09:37:06.472] https-in/1: Connection closed during SSL handshake haproxy | <134>1 2026-04-08T09:37:07.783383+00:00 - haproxy 8 - - 172.24.0.1:45216 [08/Apr/2026:09:37:07.339] https-in/1: Connection closed during SSL handshake
as explained before, with 3.0 log-format-sd gets completely ignored
Summarizing here a talk we had on slack with @Vgutierrez and @Fabfur :
- In v3.0 we were experiencing unexpected sequence-id increment. This is fixed with v3.2 as of today.
- We were not-logging a lot of lines seen as invalid-messages in HAProxyKafka, and this is fixed with v3.2.
The important thing to remember from this investigation is, from Valentin:
"it's now clear that under certain configurations HAProxy will increase %rt more than once per request"
For the data-engineering team, we need to change our validation mechanism:
- Remove the filter that prevents counting BADREQ in the sequenceIds to only look for actual holes
- Either stop alerting on BADREQ number, or add a dedicated threshold (we currently use the same threshold for sequenceId and BADREQ problems) (https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/main/dags/webrequest/refine_webrequest_hourly_dag_factory.py?ref_type=heads#L208)
Change #1269389 had a related patch set uploaded (by Fabfur; author: Fabfur):
[operations/puppet@production] Revert "haproxy: temporary removing haproxy3.2 specific conf"
Removing myself as the task assignee so that someone else take it while I'm on holidays.
Reopening and tagging Data-Engineering since we do still need to make the pipeline changes that @JAllemandou suggests:
CC @Ahoelzl
Change #1275815 had a related patch set uploaded (by Joal; author: Joal):
[analytics/refinery@master] Update webrequest validation algorithm
joal opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2192
Update webrequest validation job thresholds
Change #1275815 merged by Xcollazo:
[analytics/refinery@master] Update webrequest validation algorithm
xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2192
Update webrequest validation job thresholds
Even though the patch applied cleanly and DAG refine_webrequest_hourly_text continues to move forward, the error email continues to be generated? CC @JAllemandou
The problem is due to a forgotten step in deploying the refinery repo. I have just fixed that. Let's validate that the errors stop before moving to done.
From current Ops Week:
- The ERROR emails have stopped.
- We are still getting WARNING emails quite frequently:
- refine_webrequest_hourly_text, pretty much every hour.
- refine_webrequest_hourly_upload ~10 in the last 2 days.
We need to tweak the warning thresholds...
@xcollazo Yes, we still have very frequent warnings.
https://airflow.wikimedia.org/dags/refine_webrequest_hourly_text/grid?search=refine_webrequest_hourly_text
Almost all the warning emails were sent here.
Summarizing a slack thread from @mforns here:
I've been looking at the "Data problem WARNING/ERROR" alerts.
We thought that after Joseph's fix to the webrequest_sequence_stats tables, the alerts would cease. But they are still there.
That said, I've checked most of them, and found that the missing rows are not big missing chunks of the sequence, but rather loose single rows.
Also, most of them coincide with peaks in HAProxyKafka message count. The clearer and higher the peak, the most percent lost of webrequests in the data lake.
Also, the peaks that happen when the daily volume is already high (and the message count is already substantial in HAProxyKafka) seem more likely to cause data loss.
From what I could see in several hours of troubleshooting, it seems to me that the sequence_stats are correct, and that we are actually losing that much percent of webrequests.
That said, almost always, there seems to be an non-organic spike that justifies the glitch. Meaning, it's probably an issue to be solved at the HAProxyKafka level.
The question is now, do we want to relax our alert thresholds so to receive less alerts?
