User Details
- User Since
- Nov 4 2014, 4:29 PM (605 w, 1 d)
- Availability
- Available
- IRC Nick
- bblack
- LDAP User
- BBlack
- MediaWiki User
- BBlack (WMF) [ Global Accounts ]
Tue, May 12
I mean, bytes are bytes, and saving them is always nice. But at the average size of our general response headers, a few bytes in this one case doesn't really move the needle much. I'd say just use whatever distinguishes and/or labels it clearly-enough, within reason,
May 7 2026
May 6 2026
I'd suggest something less generic, given our headers have global context (in all the senses you can think of). A lot of different software has different notions of what a "Revision ID" is or what it might mean. Maybe something to do with the software (Hoarde?) or the project (Linked Artifact Cache?)? X-LAC-RevID? X-Hoarde-Rev? Just anything to make it a little more specific and less likely to collide with something else now or in the future.
Apr 24 2026
Has the lazyPush thing from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1277135 been tried yet? (I don't think so). Is it worth trying?
Apr 17 2026
Mar 24 2026
- On the technical front:
Mar 20 2026
@ssingh figured out where we went wrong. At the TLS terminator level, it is enabled, but at the OS level (Linux sysctl settings), it is not. We did have it enabled at that level sometime in the distant past, but the setting was lost in some past reconfiguration. We'll fix it, but probably early next week!
TFO is still configured in our TLS terminators. We'll have to investigate to figure out what has gone wrong here. Possibly this is being stripped by our loadbalancers.
Mar 19 2026
This should be active everywhere now and fixed in live traffic. Re-open if not!
Yes, our VCL has this wrong, thanks for finding this! Patches incoming for the VCL examples/tests in the vmod, and for the production VCL.
Trying to pull these threads together for a bit of context. The question that isn't explicitly answered fully is this:
Mar 11 2026
Separately, drifting off into "thoughts for the future when we have time to pursue it": none of the options we have here have been rigorously compared under adversarial conditions.
Reviewing the current situation, using haproxy-3.0 documentation as a guide:
Mar 10 2026
I've been staring at cp2043 and backup2015 to try and find some explanation for why the latter doesn't show the same drops. Both are trixie and are connected to the same switch in LLDP lsw1-a2-codfw.mgmt.
Mar 3 2026
Feb 17 2026
While I'm on these esoteric subjects - another bonus thing that some operators do, is place their nameserver *hostnames* in distinct TLDs operated by distinct operators. For example, having our 3x nameserver hostnames be ns1.wikimedia.org, ns2.wikimedia.net, ns3.wikimedia.org - because the .net TLD servers are distinct from the .org ones, and if either one were to have an (admittedly, very rare!) outage in servicing delegations towards some part of the internet, this might allow for affected recursors to still reach at least one of them even if they've lost caching on the NS hostnames. Note that .com and .net share the gtld-servers so they're not really an independent pair.
Re: anycast catchments, diversity, resilience, etc (some of this is re-treading things said above, but bear with me):
Jan 16 2026
[...]
Jan 13 2026
[In fact, on that point, I'd note a quick survey of a handful of other major sites on the Internet shows a common pattern of 2 days for the NS records and 2-4 days on the matching address records. In any case, getting us to 1 day everywhere consistently would be a good start!]
We have to take this plunge someday, and that someday probably should've been years ago, just too many other pressing things to focus on for anyone to remember to come back here and look! A few notes:
Nov 12 2025
Oct 16 2025
Aug 12 2025
@Krinkle - Diving into the specifics a bit (I think all of this will be clearer if you make a prototype VCL patch for phase 1 maybe), and re-stating/asking things for clarity:
Jul 15 2025
Jun 3 2025
Jun 2 2025
That doesn't sound right to me.
May 27 2025
FYI from the edge TLS stats POV (the first graph in the description), if we ignore the null results and just look at TLSv1.2 / (TLSv1.2 + TLSv1.3), in the ~year since this ticket was created: the TLSv1.2 percentage has dropped from ~3.76% to ~2.78%. We still have quite a ways to go before we reach a comfortable number even in that simplistic analysis.
May 20 2025
May 7 2025
Apr 14 2025
Thank you!
Apr 11 2025
@bd808 - Can you review the proposed fixups in the MR above? Thank you!
Apr 8 2025
Apr 7 2025
I don't think this seems very problematic from our perspective. If anything, it will reduce traffic from agents which formerly followed the whole 307 chain, and now may stop earlier on a 200.
Jan 8 2025
Dec 20 2024
Opened L3SC request for Legal review after the holidays: https://app.asana.com/0/1202858056100648/1209024900407685/f
Dec 13 2024
Dec 10 2024
Dec 9 2024
Dec 4 2024
Also, probably the way to standardize this for sanity (avoiding ORIGIN mistakes on both ends) is to follow some simple rules that:
Seems like a net win to me. Reduces some error-prone process stuff and makes life simpler!
Oct 25 2024
Ah interesting! We should confirm that and perhaps avoid the set-cookie entirely on cookies that are (or at least are intended to be) ~ SameSite=Lax|Strict then, I guess?
I don't think it's necessarily always up to us to be able to know it's cross-origin, though, right? It would depend on the $random_other_site's CORS whether they tell us about a referrer at all?
Oct 24 2024
Do we have a specific example of a URL and which cookies triggered the rejections? In my own quick repro attempt, I only saw them failing on actually cross-domain traffic (in my case, an enwiki page was loading Math SVG content from https://wikimedia.org/api/..., and it was the cookies coming with that response that were rejected).
Seems like all of these Varnish-level cookies mentioned at the top should at least gain appropriate, explicit SameSite= attributes, in addition to perhaps Partitioned as appropriate (only NetworkProbeLimit currently carries a SameSite attribute at all).
Oct 11 2024
Sep 20 2024
Jul 23 2024
Note also Digicert's annual renewal is coming soon in T368560 . We should maybe look at whether the OCSP URI is optional in the form for making the cert, and turn it off (assuming they also have CRLs working fine). Or if they're not ready for this, I guess Digicert waits another year.
Firefox has historically been the reason we've been stapling OCSP for the past many years. If our certificate has an OCSP URI in its metadata, then Firefox will check OCSP in realtime (which is a privacy risk) unless our servers staple the OCSP to the TLS negotiation (which we do!). This applies to both our Digicert and LE unified certs (and I'm sure some other lesser cases as well!).
Jul 2 2024
^ While we can maintain the VCL-level hacks for now, it would be best to both dig into how this actually happened (most likely, we ourselves emitted donate.m links from a wiki, probably donatewiki itself?), and to come up with a permanent solution at the application layer (fix the wiki to support these links properly and directly). We don't want to keep accumulating hacks like these in our already-overly-complex VCL code if unwarranted.
Jun 27 2024
Note there was some phab/brain lag here, I wrote this before I saw joe's last response above, they overlap a bunch
Jun 7 2024
Jun 4 2024
Jun 3 2024
Re: "same logic" - they're different protocols, different hierarchies, and much different on the client behavior front as well. It doesn't make sense to share a strategy between the two.
May 31 2024
Yes, from a resiliency POV, in some senses keeping unicasts in the mix is an answer (and it's the answer we currently rely on). In a world with only very smart and capable resolvers, the simplest answer probably is the current setup. And indeed, not-advertising ns2 from the core DCs would be a very slight resiliency win over that.
Yeah my general thinking was get ns1-anycast going first, and then figure out any of the above about better resiliency before we consider withdrawing ns0-unicast completely.
Yeah, I've looked at this from the deep-ntp-details POV and it's all pretty sane. We're in alignment with the recommendations in https://www.rfc-editor.org/rfc/rfc8633.html#page-17 and it should result in good time sync stability.
May 30 2024
On that future discussion topic (sorry I'm getting nerdsniped!) - Yeah, I had thought about prepending (vs the hard A/B cutoff) as well, but I tend to think it doesn't offer as much resiliency as the clean split.
Re: anycast-ns1 and future plans, etc (I won't quote all the relevant bits from both msgs above):
May 25 2024
There are brand/identity dilution and confusion issues with using any of *.wiki in an official capacity, especially as canonical redirectors for Wikipedia itself, which is why we didn't start using these many years ago when they were first offered for free.
May 23 2024
May 22 2024
I'm a little leery of dropping the TTL really-short. I get the argument for the normal case, but we also have to consider the possibility that something out there on the Internet could cause traffic surges to some of these URLs and we'd lose some amount of caching defenses against it with a short TTL (esp if we're also no longer pregenerating them, making such traffic more-expensive on the inside). Re-routing sounds better? Or perhaps even-better would be a full-on redirect to the new parsoid URL paths?
May 17 2024
What a fun deep-dive! :)
May 16 2024
May 15 2024
May 14 2024
Also similarly T214998
T215071 <- throwing this in here for semi-related context. Maybe we can align on a potential common future URI scheme anyways, while not actually yet tackling that one.
May 10 2024
Should be all set, may take up to ~30 minutes for changes to propagate.
May 8 2024
The patch should fix things up, let me know if there's still problems after ~half an hour to let the change propagate through the systems.
May 3 2024
We could choose to use subdivision-level mapping in cases where it makes sense.
Mar 27 2024
Mar 26 2024
Jan 19 2024
We discussed this in Traffic earlier this week, and I ended up implementing what I think is a reasonable solution already, so now I've made this ticket for the paper trail and to cover the followup work to debianize and usefully-deploy it. The core code for it is published at https://github.com/blblack/tofurkey .
Dec 5 2023
The perf issues are definitely relevant for traffic's use of haproxy (in a couple of different roles). Your option (making a libssl1.1-dev for bookworm that tracks the sec fixes that are still done for the bullseye case, and packaging our haproxy to build against it) would be the easiest path from our POV, for these cases.
