Wed, Jun 19
Implementing a blanket redirect to the legacy blog URI for ^/20(0[7-9]|1[0-8])/ should be feasible in VCL or Lua at the edge. Or alternatively, we could also just leave it alone and pick another hostname, too.
Sat, Jun 8
The TLS-level error is just complaining that, at the end of the transaction, the connection was aborted abruptly instead of torn down cleanly. It would probably be more-ideal if gerrit's TLS stack would cleanly close on 500s when it can, but the real issue here is probably the 500 error, not the TLS error. At a glance, the GET request headers look identical in the two cases, so I'm at a loss as to what's happening on gerrit's side here. Is there perhaps a request difference in some HTTP-level authentication or cookie stuff that's not shown in the trace?
Thu, Jun 6
@leila and @Miriam - Thanks for all the hard work here, it's truly outstanding the depth to which this analysis already goes, and it puts some useful numbers on the impact of expanding our edge network into under-served regions.
Thu, May 30
That alert basically means that a varnish frontend daemon crashed (and as usual was auto-restarted by a manager process). These are pretty rare and usually worth some investigation.
Wed, May 29
The failed reimage was finished up manually (probably not the reimager's fault)
Done. Are we ready to deploy it already or blocked on other MW-level deploys still?
Tue, May 28
Plan seems reasonable based on the info in the description! Maybe wait longer than 2h after the linecard is restarted? Or do we suspect that any recurrence is much less likely with no traffic?
May 24 2019
That cloud rebranding link above also mentions wikimediacloud.org, which is yet another option nobody's exploiting yet. So even without getting into the over-long wikimediacloudservices.org, we have sufficient names to cover all the cases here (feel free to re-arrange, esp the latter two):
Ok, @aborrero caught me up on all the context on IRC so I can stop asking dumb questions (Thanks!).
May 23 2019
These are reimaged to role(spare::system) now. Over to @ayounsi for getting rid of all the special cases related to these hosts in the eqiad routers and switches (BGP stuff, fw filters, the special public-vlan LVS-balancer port groups, etc), and then we can move this on to dcops -level decom stuff.
Do these belong in wikimedia.org at all? It seems this has already been discussed, but I guess I lack some context.
A few thoughts:
May 22 2019
Either is fine. I assume you won't be able to do anything else with this (e.g. make https://gsuite-test.wikimedia.org/ work) without some followup records added on our side.
So we've reduced query volume by ~32% in T208263 . Since the last significant updates here, we've also deployed newer versions of our authdns software which perform even better, and refreshed some hardware as well. We're still in the basic scenario that we only have 3x singular authdns hosts in the world, but they're running with plenty of headroom in terms of handling query rate spikes and server outages. There's really two things holding us up on experimenting with lower TTLs for faster failover:
Scheme has been stable for ~1w now and seems to be working out fine. The net reduction in total authdns requests is ~32%. I suspect the drop in public requests for wiki hostnames is greater, as the total also includes all of our internal/infrastructure lookups as well, but either way we should be seeing far less DNS cache misses out there in the world, especially for longer-tail / less-popular project and language combinations.
The above is deployed. I'd wait a full 10 minutes from the time of this comment to re-test, in case they've negative-cached the previous lookup, then try again and let's see what happens.
The context of the second token is that all of our canonical wiki domains, including wikimedia.org, already have persistent Google Site Verification TXT tokens so that we can manage Google Search stuff for our own domains on a different Google system.
@HMarcus - The record is live, can you try the validation and let me know how it goes?
May 21 2019
Nevermind, apparently it was already repooled, looking at the wrong thing here...
It's been up for ~15 days now without incident, but depooled for frontend traffic. Re-pooling it today to see if we can get a recurrence or not.
FWIW, lvs1016 came back with correct settings after the single additional reboot above.
Current status of transition:
May 19 2019
Note https://gerrit.wikimedia.org/r/c/operations/puppet/+/511118 - I had to switch the lvs1015 cross-row ports for rows A and B (enp4s0f1 and enp5s0f0) backwards at the software level to match the physical reality shown by lldpcli show neighbors, which was backwards from the documented table of ports at the top of this task. The current config works and we can keep it if we want. Note that I didn't make any other related changes, so if we keep this config, we probably need to edit the software port labels in the switch configurations to match, and possibly any physical labeling in the DC, to avoid future confusion. Alternatively, before we put this machine in service, we could physically swap the cables back to the intended config at the rear of lvs1015, revert the mentioned puppet patch, and reimage the server again. Either way, there's probably some followup to do on this.
May 18 2019
May 16 2019
Outside of immediate emergency situations, resolving any blockers to get the remaining two LVSes into service should be a very high priority at this point.
Any chance this is interrelated with T222994 ?
May 14 2019
@kostajh - The OONI article you linked ( https://ooni.torproject.org/post/2019-china-wikipedia-blocking/ ) is accurate, and it's outside of our scope (more with our Legal and Communications teams at a high level) to communicate publicly and officially on that situation, if at all (they are aware!).
May 9 2019
Our analytics seems to indicate the changes above had the intended effect in restoring normal levels of traffic from CN for affected projects:
Yeah, it's mostly just blocked on us making some time to deal with it, and time has been in extremely short supply lately, so we tend not to prioritize anything that doesn't have imminent impact. There's some subtleties to backing out that stuff in stages and not breaking things.
@Cwek @lilydjwg - Thanks for the reports! I apologize, this time around the fallout should've been predictable, given what we know from https://ooni.io/post/2019-china-wikipedia-blocking/ about the mechanisms, we just didn't think it through. I've pushed some changes above to move the CNAME target over to a new hostname dyna.wikimedia.org, which should fix things assuming CN's censorship tactics remain otherwise-stable. It will take up to roughly an hour for global DNS caches to catch up with the change and then we can continue investigations from there.
May 8 2019
Putting this here for lack of a better place, for future reference:
May 5 2019
May 3 2019
@Miriam - sorry for the slowness!
May 2 2019
The current iteration of the proposed broadly-applied production version is in PS3 of the patch @ https://gerrit.wikimedia.org/r/c/operations/dns/+/507399/3 (and then a followup switch from 1H to 1D CNAME TTLs to go out shortly afterwards), will likely shoot for deployment early next week.
May 1 2019
Apr 25 2019
Re: wikibase.org, adding it as a non-canonical redirection to catch confusion from those that manually type URLs is fine, but we should make sure everyone is clear on which domainname is canonical for this project (I assume https://wikiba.se/) and make sure that's the only one that's published, promoted, and used for links we control and such. It's an important notion that one name is canonical!
Apr 24 2019
@Cwek - Thanks for the reports! Have you tried other Wikimedia projects (e.g. wikiversity, wikiquote, wiktionary, etc) for SNI testing and/or DNS lookups from within? That may provide some level of insight as well. Currently we suspect the DNS changes here were not related to the new blockage, but obviously we'd like to gather all the data we can. The initial deployment date of the structural change was actually Apr 18th; the changes on the 20th merely extended the TTLs of that scheme from 10 minutes to 4 hours. Our own analytics seems to confirm that the dropoff of CN traffic was actually on the 23rd (same as when the community noticed).
Apr 20 2019
Status update on the experiments above:
Apr 19 2019
Apr 9 2019
It's not ideal, but the part that was stripped was the most-predictable part of the name (the en prefix), so it's not all that confusing.
Apr 8 2019
The wiktionary CNAME experiment is going out today, and I'm intending to keep it running for at least a week, assuming no issues arise.
- 0100-dynamic-tls-records.patch - I don't think we ever managed to prove a significant benefit from this on initial deploy, but it's just one of those things that seemed like a "good idea" so long as it remained simple to leave it in. I'd be happy with dropping this initially and putting the ideas behind that patch (or even more-generalized than that patch) on the back burner for the future when we have more time.
- 0660-version-too-low.patch - This was a very nginx-specific thing about not having nginx spam error messages, shouldn't need to port it at all.
Apr 5 2019
We may try the wiktionary patch early next week. The goal with that test is just to see if we get any user complaints about wiktionary.org resolution being broken, so we'll leave it in place for a week or so if we don't get complaints, or revert if we do. Either way it will eventually get reverted, and if it's successful then we'll start patching for the "real" version where everything centralizes into a wikipedia.org hostname, so that's probably still at least a couple weeks out.
Mar 29 2019
There's some complexities here that I've been stewing on for a while, mostly noted in the original description, but I like this general direction. Most of the concerns briefly mentioned earlier aren't actually a big deal in practice, but there remains a key issue around CNAME + edns-client-subnet, and the decision between putting the terminal DYNA record in either wikipedia.org or some other domain (preferably one not used by current canonicals at all, e.g. maybe this variant would be a good use for wikimedia.net?). Where I'm at now in thinking on these two paths:
Mar 12 2019
I think it would be better, from my perspective, to really understand the use-cases better (which I don't). Why do these remote clients need "realtime" (no staleness) fetches of Q items? What I hear is it sounds like all clients expect everything to be perfectly synchronous, but I don't understand why they need to be perfectly synchronous. In the case that lead to this ticket, it was a remote client at Orange issuing a very high rate of these uncacheable queries, which seems like a bulk data load/update process, not an "I just edited this thing and need to see my own edits reflected" sort of case.
Mar 8 2019
Looking at an internal version of the flavor=dump outputs of an entity, related observations:
Mar 4 2019
The raw data should be accurate. I had thought we were already sending the summarized X-Cache-Status to hadoop as well, but apparently not. It might be useful to get that going in another ticket, because it saves dealing with some of the complexity below. In the meantime:
Feb 27 2019
Circa 2019-02-21, eqsin was depooled to install a new router, and most of the users normally mapped to eqsin had fallen back to ulsfo temporarily, which would distort the stats of "ulsfo users" considerably.
Feb 26 2019
Feb 25 2019
The VCL looks good, please give us some notice (~24h would be ideal?) on when you need it actually deployed once you've decided on a date. Any news on the Desktop-denial regression?
Feb 21 2019
Seems to be working fine after replacement!
Feb 20 2019
There are different layers of "handing off" DNS management which are being conflated, but to run through them in order:
Feb 15 2019
Correct me if I'm wrong, but I would think all VE traffic would already be uncacheable at the Varnish level anyways, since it happens in the context of a session (although in the future we might fix this with content composition work). As for the rest of this discussion, I don't think I understand the context enough to say anything about its sanity or whether it increases any attack surface in a way that matters.
Feb 14 2019
Feb 13 2019
Feb 10 2019
Feb 8 2019
Feb 7 2019
Sounds good to me!
Expounding on the lamentations above in a more realistic triage sort of sense:
The linked ESNI ticket is kind of a random user question ticket, and not actually one created for working on it (which still off in the Future, but obviously we'll implement ESNI as soon as we realistically can, as part of planned work).
Jan 23 2019
See also T178011 for last time. Why didn't the icinga EDAC check catch this?
https://wikitech.wikimedia.org/wiki/Cache_servers#Depool_and_downtime is correct, it just needs to be depooled (it will auto-depool on shutdown, but a manual depool is preferable).
It's the same basic rationale as moving WMCS out of 10.68.0.0/16. We could obviously leave them there and just manage our ACLs better with more automation, but it pays some pretty big dividends when address spaces are clearly split on such a big security and functional boundary as Prod-v-WMCS. Humans will always look at IPs as well in various debugging and configuration tasks. Having similar/shared/adjacent numbering for these two realms invites confusion and mistakes.
Jan 16 2019
See also this email thread where Michael Chan (broadcom driver dev) asks for firmware level output, sees the same numbers we have on cp1088, and tells them to upgrade: https://www.spinics.net/lists/netdev/msg519478.html
Jan 14 2019
I don't have any suggestions, no. Develop a straw-patch which at least serves in code terms to document the intent (e.g. the explicit header and URI values matched/transformed, etc) and we'll poke at it!
Jan 11 2019
It's a confusing set of things going on here, and it's going to need fixups on both the network/data/data.yaml side and the VCL side. Just to recap the historical situation for clarity:
Jan 10 2019
Right. I'm not up to speed on where all related changes are, but from VCL's point of view its definition of wikimedia_nets was meant to include labs, whereas its nearly identical wikimedia_trust is meant to exclude labs.
Yeah. It's hard to "prove" whether we have this bug fixed other than running a supposed fix on the bnxt_en cp10 fleet for a while as a statistical test, but probably the sooner we start on that the better.
Actually, it is already in the 4.9.y LTS/stable branch, here: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.9.y&id=b2be15bb02b961146177d49204de22df3dddd415
I suspect our bug is fixed by:
Dec 31 2018
Dec 21 2018
Dec 20 2018
Update for the record: with recent changes to authdns CI and deployment scripts, this scenario should no longer be possible and workarounds shouldn't be necessary! (see also related distant past incident T103915)