Page MenuHomePhabricator

Surge in webrequest validation check
Open, MediumPublic

Description

Since 2026-03-31T13:00, the webrequest validation check reports warning or error for every hour for the text cache.
When looking at hosts, the problem seems to be coming from the drmrs, ulsfo and magru datacenters (codfw also has problematic percentages, but the traffic is very small).
The problem seems related to bad requests being received by the caches.
I have gathered data in this spreadsheet: https://docs.google.com/spreadsheets/d/1aDdhoeruzzWXiIe1gHwy0spj1Odmqd6P1hZQe8TmyFY/edit?usp=sharing

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Update webrequest validation job thresholdsrepos/data-engineering/airflow-dags!2192joalupdate_webrequest_validation_thresholdsmain
Customize query in GitLab

Event Timeline

This could be related to upgrade to HAProxy 3.2 (T421402) that started on the drmrs datacenter, we'll investigate if the sequence-id is now generated differently

Change #1266301 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] cache::haproxy: rename deprecated instructions in haproxy 3.2

https://gerrit.wikimedia.org/r/1266301

This is most probably due to a deprecation in haproxy configuration directives https://www.haproxy.com/blog/reviewing-every-new-feature-in-haproxy-3-1#deprecation, especially

option     accept-invalid-http-request
option     accept-invalid-http-response

Change #1266301 merged by Fabfur:

[operations/puppet@production] cache::haproxy: rename deprecated instructions in haproxy 3.2

https://gerrit.wikimedia.org/r/1266301

The patch has been applied to all impacted hosts, thanks @JAllemandou for spotting this!

JAllemandou renamed this task from Surge in webrequest sequence-id validation check to Surge in webrequest validation check.Apr 2 2026, 6:42 AM
JAllemandou updated the task description. (Show Details)

This is most probably due to a deprecation in haproxy configuration directives https://www.haproxy.com/blog/reviewing-every-new-feature-in-haproxy-3-1#deprecation, especially

option     accept-invalid-http-request
option     accept-invalid-http-response

This is not accurate, applying this config change didn't solve the issue. An interesting change in behavior from 3.0 to 3.2 and that could be related is that after the upgrade, haproxykafka number of invalid messages dropped to 0: https://grafana.wikimedia.org/goto/afi95ztecv20wd?orgId=1

An interesting change in behavior from 3.0 to 3.2 and that could be related is that after the upgrade, haproxykafka number of invalid messages dropped to 0: https://grafana.wikimedia.org/goto/afi95ztecv20wd?orgId=1

This confirms another change we're experiencing with the new version:

  • Version 3.0: the number of problematic sequence-id rows is ~2x the number of BADRED rows
  • Version 3.2: the number of problematic sequence-id rows is equvalent to the number of BADREQ rows

--> The "lost" sequence-id rows were going in Invalid messages!

We talked last Thursday with @Fabfur, and he told me we should adapt our validation mechanism, because the change you experience was valid from your point of view.
Can we get a formal confirmation this is the way to go please ? :)

It looks like the root cause is MEDIUM: log/session: handle embryonic session log within sess_log(). A change introduced in HAProxy 3.1 as part of the work done to introduce the log profiles feature.

From the commit message:

Also, thanks to this change, log-format-sd will now be taken into account
for legacy embryonic session logging.

All these new BADREQ rows are actually TLS handshake errors that under 3.0 were being discarded by haproxykafka cause the log message wasn't a valid RFC5424 message.

It looks like the root cause is MEDIUM: log/session: handle embryonic session log within sess_log(). A change introduced in HAProxy 3.1 as part of the work done to introduce the log profiles feature.

From the commit message:

Also, thanks to this change, log-format-sd will now be taken into account
for legacy embryonic session logging.

All these new BADREQ rows are actually TLS handshake errors that under 3.0 were being discarded by haproxykafka cause the log message wasn't a valid RFC5424 message.

Nice finding, thanks!

My hypotheses about the following bugs being backported to 3.0.12 is now obsolete but I'll write it down anyway for future reference:

MINOR: http: add a function to validate characters of :authority
BUG/MEDIUM: h2/h3: reject some forbidden chars in :authority before reassembly
BUG/MEDIUM: h1/h2/h3: reject forbidden chars in the Host header field

(These 3 patches has been introduced in 3.2 and backported also to 3.0.12. Considering official Debian Trixie repositories are on 3.0.11, it's plausible that this behavior isn't showed on the non-upgraded hosts)

I would say that we can proceed with upgrading HAProxy to 3.2 on the remaining datacenters

Thanks for confirming the invalid-events change @Vgutierrez.
There still is something I don't understand:

  • The pattern we see in v3.0 seems to show that haproxykafka generates a sequence-id even when it discards a log as invalid, as we see 2x the number of invalid seqId per BADREQ.
  • With v3,2 we see a big increase in BADREQ that don't correlate with the number of invalid-seqenceId we were reporting previsouly: there are way more.

Could it be possible that my assumption that haproxykafka generates a seqId even for invalid messages is partially true, as in, only a small subset of invalid messages generate seqId?

Yes, sequence numbers are enerated by haproxy itself, even if it results in a SSL handshake error where the sequence number doesn't reach haproxykafka when using haproxy 3.0 because the log format is ignored for that kind of error.

With 3.2, the log format is honored and haproxykafa is seeing all these sequence numbers

Yes, sequence numbers are enerated by haproxy itself, even if it results in a SSL handshake error where the sequence number doesn't reach haproxykafka when using haproxy 3.0 because the log format is ignored for that kind of error.

With 3.2, the log format is honored and haproxykafa is seeing all these sequence numbers

This perfectly make sense.
But: Since in v3.0 we had 2x more invalid-seqId than BADREQ, I would have expected with v3.2 to get roughly 2x more BADREQ, assuming the newly valid SSL handshake logs are generated as such.
The concern I have is that we see 10x more, not 2x more.

image.png (461×741 px, 53 KB)

I've replicated locally a SSL handshake failure using haproxy with log-format-sd %{+E}o\ [haproxykafka@0\ %[capture.req.hdr(0),json(ascii)]|%HPO|%HQ|%rt]

the sequence number (%rt) is logged as the last field before ],
with haproxy 3.2 it gets logged like this:

haproxy  | <134>1 2026-04-08T09:21:07.178911+00:00 - haproxy 7 - [haproxykafka@0 -|<BADREQ>|<BADREQ>|0] 172.24.0.1:45128 [08/Apr/2026:09:21:04.950] https-in/1: Connection closed during SSL handshake
haproxy  | <134>1 2026-04-08T09:21:12.764427+00:00 - haproxy 7 - [haproxykafka@0 -|<BADREQ>|<BADREQ>|1] 172.24.0.1:44526 [08/Apr/2026:09:21:12.251] https-in/1: Connection closed during SSL handshake
haproxy  | <134>1 2026-04-08T09:21:15.979792+00:00 - haproxy 7 - [haproxykafka@0 -|<BADREQ>|<BADREQ>|2] 172.24.0.1:44536 [08/Apr/2026:09:21:15.705] https-in/1: Connection closed during SSL handshake
haproxy  | <134>1 2026-04-08T09:21:17.221063+00:00 - haproxy 7 - [haproxykafka@0 -|<BADREQ>|<BADREQ>|3] 172.24.0.1:44542 [08/Apr/2026:09:21:16.375] https-in/1: Connection closed during SSL handshake

If I use the very same configuration against haproxy 3.0 I get this:

haproxy  | <134>1 2026-04-08T09:37:03.868168+00:00 - haproxy 8 - - 172.24.0.1:45190 [08/Apr/2026:09:37:03.165] https-in/1: Connection closed during SSL handshake
haproxy  | <134>1 2026-04-08T09:37:06.778075+00:00 - haproxy 8 - - 172.24.0.1:45200 [08/Apr/2026:09:37:06.472] https-in/1: Connection closed during SSL handshake
haproxy  | <134>1 2026-04-08T09:37:07.783383+00:00 - haproxy 8 - - 172.24.0.1:45216 [08/Apr/2026:09:37:07.339] https-in/1: Connection closed during SSL handshake

as explained before, with 3.0 log-format-sd gets completely ignored

Summarizing here a talk we had on slack with @Vgutierrez and @Fabfur :

  • In v3.0 we were experiencing unexpected sequence-id increment. This is fixed with v3.2 as of today.
  • We were not-logging a lot of lines seen as invalid-messages in HAProxyKafka, and this is fixed with v3.2.

The important thing to remember from this investigation is, from Valentin:
"it's now clear that under certain configurations HAProxy will increase %rt more than once per request"

For the data-engineering team, we need to change our validation mechanism:

Change #1269389 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] Revert "haproxy: temporary removing haproxy3.2 specific conf"

https://gerrit.wikimedia.org/r/1269389

Removing myself as the task assignee so that someone else take it while I'm on holidays.

Vgutierrez claimed this task.

Closing the task as this has been investigated

xcollazo reopened this task as Open.EditedApr 10 2026, 2:53 PM

Reopening and tagging Data-Engineering since we do still need to make the pipeline changes that @JAllemandou suggests:

...
For the data-engineering team, we need to change our validation mechanism:

CC @Ahoelzl

xcollazo triaged this task as Medium priority.
xcollazo added a subscriber: Ahoelzl.

Change #1275815 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Update webrequest validation algorithm

https://gerrit.wikimedia.org/r/1275815

Change #1275815 merged by Xcollazo:

[analytics/refinery@master] Update webrequest validation algorithm

https://gerrit.wikimedia.org/r/1275815

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2192

Update webrequest validation job thresholds

Now that the MR is merged in, will wait till next couple hourly runs to validate.

Even though the patch applied cleanly and DAG refine_webrequest_hourly_text continues to move forward, the error email continues to be generated? CC @JAllemandou

The problem is due to a forgotten step in deploying the refinery repo. I have just fixed that. Let's validate that the errors stop before moving to done.

From current Ops Week:

  • The ERROR emails have stopped.
  • We are still getting WARNING emails quite frequently:
    • refine_webrequest_hourly_text, pretty much every hour.
    • refine_webrequest_hourly_upload ~10 in the last 2 days.

We need to tweak the warning thresholds...

Maybe we should double all thresholds.

Summarizing a slack thread from @mforns here:

I've been looking at the "Data problem WARNING/ERROR" alerts.
We thought that after Joseph's fix to the webrequest_sequence_stats tables, the alerts would cease. But they are still there.
That said, I've checked most of them, and found that the missing rows are not big missing chunks of the sequence, but rather loose single rows.
Also, most of them coincide with peaks in HAProxyKafka message count. The clearer and higher the peak, the most percent lost of webrequests in the data lake.
Also, the peaks that happen when the daily volume is already high (and the message count is already substantial in HAProxyKafka) seem more likely to cause data loss.
From what I could see in several hours of troubleshooting, it seems to me that the sequence_stats are correct, and that we are actually losing that much percent of webrequests.
That said, almost always, there seems to be an non-organic spike that justifies the glitch. Meaning, it's probably an issue to be solved at the HAProxyKafka level.
The question is now, do we want to relax our alert thresholds so to receive less alerts?