While I'm on these esoteric subjects - another bonus thing that some operators do, is place their nameserver *hostnames* in distinct TLDs operated by distinct operators. For example, having our 3x nameserver hostnames be ns1.wikimedia.org, ns2.wikimedia.net, ns3.wikimedia.org - because the .net TLD servers are distinct from the .org ones, and if either one were to have an (admittedly, very rare!) outage in servicing delegations towards some part of the internet, this might allow for affected recursors to still reach at least one of them even if they've lost caching on the NS hostnames. Note that .com and .net share the gtld-servers so they're not really an independent pair.
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Yesterday
Re: anycast catchments, diversity, resilience, etc (some of this is re-treading things said above, but bear with me):
Jan 16 2026
In T81605#11530397, @cmooney wrote:But it seems Bind does not cache the glue records / additional that comes back from the .org authdns. At least for any length of time. It immediately makes separate A record queries for the nameservers .org returns, sent to the IPs from the glue records. It's the responses from those direct queries to us that it caches, with a TTL of 86400.
In T81605#11529143, @cmooney wrote:In T81605#11518553, @ssingh wrote:Our glue records also have a disparity.
I was interested to know what effect this would have. One data-point for Bind (at least my local instance) is that it is caching the TTL that comes back from our servers, not the glue records in the ORG zone:
cathal@officepc:~$ dig A ns0.wikimedia.org @192.168.240.1
[...]
Jan 13 2026
[In fact, on that point, I'd note a quick survey of a handful of other major sites on the Internet shows a common pattern of 2 days for the NS records and 2-4 days on the matching address records. In any case, getting us to 1 day everywhere consistently would be a good start!]
We have to take this plunge someday, and that someday probably should've been years ago, just too many other pressing things to focus on for anyone to remember to come back here and look! A few notes:
Nov 12 2025
In T368344#11366800, @HCoplin-WMF wrote:Question about:
When authenticating with cookies, all cookies use the Secure flag already, so a reasonable client would not send them over HTTP.
. However, it's unclear where/when the redirect is actually happening.
Oct 16 2025
Aug 12 2025
@Krinkle - Diving into the specifics a bit (I think all of this will be clearer if you make a prototype VCL patch for phase 1 maybe), and re-stating/asking things for clarity:
Jul 15 2025
Jun 3 2025
In T395827#10879292, @Milimetric wrote:So 178877 / 20294660 = 0.0088, which means 0.88%
Pretty close.
Jun 2 2025
That doesn't sound right to me.
May 27 2025
FYI from the edge TLS stats POV (the first graph in the description), if we ignore the null results and just look at TLSv1.2 / (TLSv1.2 + TLSv1.3), in the ~year since this ticket was created: the TLSv1.2 percentage has dropped from ~3.76% to ~2.78%. We still have quite a ways to go before we reach a comfortable number even in that simplistic analysis.
May 20 2025
May 7 2025
In T391959#10774171, @dr0ptp4kt wrote:In T391959#10769333, @tchin wrote:subject_id should be base64-decodable and its length prior to decoding should be at least 22 characters
Isn't the length of a base64 string always supposed to be a multiple of 4? Where does the 22 come from?
I think the intent is to apply sodium_base64_VARIANT_URLSAFE_NO_PADDING upon a hash output that is a uint8_t[16], although better verify - good question! @BBlack can you confirm that we may see the value without = symbol padding when it comes across to EventGate? And is 22 the correct shortest length in the initial implementation or would we possibly expect a longer length (e.g., 44 characters, which coincidentally would cleanly divide by 4) in the initial implementation?
Apr 14 2025
Thank you!
Apr 11 2025
@bd808 - Can you review the proposed fixups in the MR above? Thank you!
Apr 8 2025
Apr 7 2025
I don't think this seems very problematic from our perspective. If anything, it will reduce traffic from agents which formerly followed the whole 307 chain, and now may stop earlier on a 200.
Jan 8 2025
Dec 20 2024
Opened L3SC request for Legal review after the holidays: https://app.asana.com/0/1202858056100648/1209024900407685/f
Dec 13 2024
Dec 10 2024
Dec 9 2024
Dec 4 2024
Also, probably the way to standardize this for sanity (avoiding ORIGIN mistakes on both ends) is to follow some simple rules that:
Seems like a net win to me. Reduces some error-prone process stuff and makes life simpler!
Oct 25 2024
Ah interesting! We should confirm that and perhaps avoid the set-cookie entirely on cookies that are (or at least are intended to be) ~ SameSite=Lax|Strict then, I guess?
I don't think it's necessarily always up to us to be able to know it's cross-origin, though, right? It would depend on the $random_other_site's CORS whether they tell us about a referrer at all?
Oct 24 2024
Do we have a specific example of a URL and which cookies triggered the rejections? In my own quick repro attempt, I only saw them failing on actually cross-domain traffic (in my case, an enwiki page was loading Math SVG content from https://wikimedia.org/api/..., and it was the cookies coming with that response that were rejected).
Seems like all of these Varnish-level cookies mentioned at the top should at least gain appropriate, explicit SameSite= attributes, in addition to perhaps Partitioned as appropriate (only NetworkProbeLimit currently carries a SameSite attribute at all).
Oct 11 2024
In T327286#10194390, @ZauberViolino wrote:Recent days, WMF somehow changed their GeoDNS so that requests from China would get a not yet blocked IPv4 address.
I have to mention that on Oct 1 WMF projects were resolved back to the blocked servers in San Francisco (ulsfo) with the IPv4 address 198.35.26.96.
Sep 20 2024
Jul 23 2024
Note also Digicert's annual renewal is coming soon in T368560 . We should maybe look at whether the OCSP URI is optional in the form for making the cert, and turn it off (assuming they also have CRLs working fine). Or if they're not ready for this, I guess Digicert waits another year.
Firefox has historically been the reason we've been stapling OCSP for the past many years. If our certificate has an OCSP URI in its metadata, then Firefox will check OCSP in realtime (which is a privacy risk) unless our servers staple the OCSP to the TLS negotiation (which we do!). This applies to both our Digicert and LE unified certs (and I'm sure some other lesser cases as well!).
Jul 2 2024
^ While we can maintain the VCL-level hacks for now, it would be best to both dig into how this actually happened (most likely, we ourselves emitted donate.m links from a wiki, probably donatewiki itself?), and to come up with a permanent solution at the application layer (fix the wiki to support these links properly and directly). We don't want to keep accumulating hacks like these in our already-overly-complex VCL code if unwarranted.
Jun 27 2024
Note there was some phab/brain lag here, I wrote this before I saw joe's last response above, they overlap a bunch
Jun 7 2024
In T365690#9865523, @Abbe98 wrote:One problem that arises from that approach is the inability of the service’s owner to be able to know if there are individual users who are using a disproportionate amount of resources, or even if there is an abuser in the mix. This is the classic “tragedy of the commons'' situation. Also use of authentication here is not to keep an eye on who is using the API and for what reason. Anyone can use anonymous email accounts to create an account and use that without disclosing any personal information. It is mostly meant to keep users of the APIs updated on changes as well and to give warnings for deprecations as well. This all assumes that the user of the APIs has not abandoned the Open source project and someone is still maintaining it.
This could be said for all Wikimedia APIs.
Jun 4 2024
Jun 3 2024
Re: "same logic" - they're different protocols, different hierarchies, and much different on the client behavior front as well. It doesn't make sense to share a strategy between the two.
In T366193#9853619, @ayounsi wrote:I think the difficult part is where to stop the overengineering, for example it could make sens to use Liberica to healthcheck/advertise one of the NS anycast IP, but it might not be worth using a different AuthDNS software on half the servers, or a different Puppet infra.
May 31 2024
Yes, from a resiliency POV, in some senses keeping unicasts in the mix is an answer (and it's the answer we currently rely on). In a world with only very smart and capable resolvers, the simplest answer probably is the current setup. And indeed, not-advertising ns2 from the core DCs would be a very slight resiliency win over that.
Yeah my general thinking was get ns1-anycast going first, and then figure out any of the above about better resiliency before we consider withdrawing ns0-unicast completely.
Yeah, I've looked at this from the deep-ntp-details POV and it's all pretty sane. We're in alignment with the recommendations in https://www.rfc-editor.org/rfc/rfc8633.html#page-17 and it should result in good time sync stability.
May 30 2024
On that future discussion topic (sorry I'm getting nerdsniped!) - Yeah, I had thought about prepending (vs the hard A/B cutoff) as well, but I tend to think it doesn't offer as much resiliency as the clean split.
Re: anycast-ns1 and future plans, etc (I won't quote all the relevant bits from both msgs above):
May 25 2024
There are brand/identity dilution and confusion issues with using any of *.wiki in an official capacity, especially as canonical redirectors for Wikipedia itself, which is why we didn't start using these many years ago when they were first offered for free.
May 23 2024
In T365630#9822721, @daniel wrote:In T365630#9822532, @BBlack wrote:Re-routing sounds better? Or perhaps even-better would be a full-on redirect to the new parsoid URL paths?
The idea is that we don't implement long term caching with active purging for the new endpoints at all. If we don't need it for the old endpoints, we don't need it for the new endpoints. This would make the architecture and the migration a whole lot simpler.
It seems to me like a couple of minutes would be good enough at least for organic spikes. Do you think the edge cache is effective protection against DDoS?
May 22 2024
I'm a little leery of dropping the TTL really-short. I get the argument for the normal case, but we also have to consider the possibility that something out there on the Internet could cause traffic surges to some of these URLs and we'd lose some amount of caching defenses against it with a short TTL (esp if we're also no longer pregenerating them, making such traffic more-expensive on the inside). Re-routing sounds better? Or perhaps even-better would be a full-on redirect to the new parsoid URL paths?
May 17 2024
What a fun deep-dive! :)
May 16 2024
In T364126#9805638, @akosiaris wrote:
- They do have a lot of presence all over the world. Presence we don't have currently and would take us decades to obtain. And via CP3, they effectively end up being a CDN. If we can take advantage of that, we might be able to reap some of the benefits we reap by setting up our own PoPs, all without having to pay the cost of setting up our own PoPs. It is conceivable that we end up giving people in underrepresented areas of the world with a better performance for our sites, effectively furthering the mission.
In T364126#9785855, @akosiaris wrote:Media (images, video, etc) are served from upload.wikimedia.org and are requested without Cookies.
May 15 2024
May 14 2024
Also similarly T214998
T215071 <- throwing this in here for semi-related context. Maybe we can align on a potential common future URI scheme anyways, while not actually yet tackling that one.
May 10 2024
Should be all set, may take up to ~30 minutes for changes to propagate.
May 8 2024
The patch should fix things up, let me know if there's still problems after ~half an hour to let the change propagate through the systems.
In T364400#9779996, @hnowlan wrote:Could we implement this remapping at the ATS layer rather than the Apache one, in a manner that would mean that when we need to cache we only need to store each effective URL once?
May 3 2024
We could choose to use subdivision-level mapping in cases where it makes sense.
Mar 27 2024
Mar 26 2024
Jan 19 2024
We discussed this in Traffic earlier this week, and I ended up implementing what I think is a reasonable solution already, so now I've made this ticket for the paper trail and to cover the followup work to debianize and usefully-deploy it. The core code for it is published at https://github.com/blblack/tofurkey .
Dec 5 2023
The perf issues are definitely relevant for traffic's use of haproxy (in a couple of different roles). Your option (making a libssl1.1-dev for bookworm that tracks the sec fixes that are still done for the bullseye case, and packaging our haproxy to build against it) would be the easiest path from our POV, for these cases.
Nov 29 2023
Followup: did a 3-minute test of the same pair of parameter changes on cp3066 for a higher-traffic case. No write failures detected via strace in this case (we don't have the error log outputs to go by in 9.1 builds). mtail CPU usage at 10ms polling interval was significantly higher than it was in ulsfo, but still seems within reason overall and not saturating anything.
I went on a different tangent with this problem, and tried to figure out why we're having ATS fail writes to the notpurge log pipe in the first place. After some hours of digging around this problem (I'll spare you endless details of temporary test configs and strace outputs of various daemons' behavior, etc), these are the basic issues I see:
Nov 22 2023
Nov 9 2023
Nov 7 2023
I don't suspect it serves any real purpose at present, unless it was to avoid some filtering that exists elsewhere to avoid cross-site sharing of /32 routes or something.
Oct 19 2023
Oct 16 2023
One potential issue with relying solely on MSS reduction is that, obviously, it only affects TCP. For now this is fine, as long as we're only using LVS (or future liberica) for TCP traffic (I think that's currently the case for LVS anyways!), but we could add UDP-based things in the future (e.g. DNS and QUIC/HTTP3), at which point we'll have to solve these problems differently.
In T348837#9253720, @cmooney wrote:The one thing you may not be able to control with mtu/advmss on a route is traffic to the local subnet, as that route is added by the kernal when the IP is added to the interface. Not sure if that can be modified to differ from interface MTU.
Could we take the opposite approach with the MTU fixup for the tunneling, and arrange the host/interface settings on both sides (the LBs and the target hosts) such that they only use a >1500 MTU on the specific unicast routes for the tunnels, but default to their current 1500 for all other traffic? If per-route MTU can usefully be set higher than base interface MTU, this seems trivial, but even if not, surely with some set of ip commands we could set the iface MTU to the higher value, while clamping it back down to 1500 for all cases except the tunnel.
Oct 11 2023
Oct 3 2023
Looks about right to me!
We could add some normalization function at the ferm or puppet-dns-lookup layer perhaps (lowercase and do the zeros in a consistent way)?