Tue, Aug 20
Just stalling this so that anyone following it doesn't try to pick this up or move with it yet. There's an ongoing email thread about clarifying this task, and we're waiting for at least one person to return from a vacation and provide guidance before we move forward here.
Mon, Aug 19
There's perhaps a faulty implicit assumption here that we desire to use one cert for the world and that we'd just "switch" everything to LE. We're currently using the Globalsign cert at all edges due to various problems earlier in the year, but what we were doing in the past and would like to continue doing in the future is using two certs simultaneously from unrelated CAs, and making the split on a per-datacenter basis (with the US sites using GlobalSign, and the non-US sites using LE, in this case).
I'd start with the conftool stuff before moving on to anything that tracks gdnsd's admin_state -driven things. That whole mechanism is likely to be replaced in the next quarter or two on the gdnsd side, and I wouldn't be surprised if we end up driving the new mechanism from conftool by default.
(Also, is the specific TMH fix actually deployed to all groups yet?)
Unassign for now. The actual ask here is unclear in terms of technical details.
Thu, Aug 15
General status updates and planning, for this very old ticket which is still on the radar!
Wed, Aug 14
As noted in T155359 - WMDE has moved the hosting of this to some other platform, including the DNS hosting (and we never had the whois entry). So this task can resolve as Decline I think (or whatever), but we should use it to track down various revert patches first before we close it up (revert the DNS repo stuff and whatever else we've got going on in various other repos supporting the wikiba.se site).
Tue, Aug 13
Looks like it to me :)
Mon, Aug 5
May as well link in an earlier related ticket from late last year for more backstory, too: https://phabricator.wikimedia.org/T205609
Again today, causing a small spike of esams-specific 503s and icinga alerts:
So, yes, cloudelastic is correct in DNS for normal lookups. The issue is that the icinga check defines the virtual host entry for cloudelastic monitoring an explicit IP in its configuratoin, and that IP ends up being the IP of icinga1001, not of cloudelastic. This probably has to do with the puppet host context in which the resource is evaluated.
Fri, Aug 2
Re: transitioning away from SLAAC for the current fleet/setup (which I think is probably a good incremental idea, and could happen ahead of the future netbox work to make that transition easier in the future). Some thoughts on accomplishing that:
Thu, Aug 1
These are ready to go for dcops-level work!
Decom in T229586
We had a quick discussion and a small informal vote and decided we don't really need this functionality (pinkunicorn) anymore, so we're going to retire it and not replace it.
Wed, Jul 31
Heh, apparently I can't even remember things I read and said before even when they're right above me in the same ticket!
Rollout status update: things that are using anycast recdns resolv.conf in production as of 2019-07-31:
- All hosts in edge DCs (esams, ulsfo, eqsin)
- All cp edge cache hosts globally
- All LVS hosts globally
- Canary Mediawiki API and Appserver hosts in both core DCs
- Network devices
- Install-time stuff (as in dhcp settings and Debian installer)
Replying to myself earlier: apparently they're datestamped URIs beginning with /yyyy/mm/, examples being:
TODO list here from my POV, as best I understand things:
The 421 code is deployed and seems to be working correctly, with a fairly small global average rate of somewhere <1 req/sec. This is the most-legitimate thing we can do with these misdirected requests, and it may actually fix some of them if the UA's own confusion is truly at fault, but it may not be able to help if some kind of DNS or HTTPS proxy interference is causing persistent issues. Maybe it will at least reduce error reporting and debugging confusion in such cases, though, as 421 is very specific to this issue (vs generic 404).
Thu, Jul 25
All the LVSes are now using the anycasted recdns, which gets rid of the LVS<->recdns dependency loop and simplifies recdns server downtime processes: https://wikitech.wikimedia.org/w/index.php?title=Service_restarts&type=revision&diff=1833705&oldid=1832671
Tue, Jul 23
Jul 23 2019
If we need this to work ASAP, probably the most-expedient thing to do would be to patch our puppetization to exclude the patched features from config on buster only, and use the vendor package. Traffic is in the process of moving away from nginx, hopefully by EOQ-ish, after which we won't need the problematic custom package, and the stock vendor package should work fine for other uses of the tlsproxy module (but we're not quite ready enough, yet, to mess with our current solution by removing the WMF package from stretch!).
Jul 22 2019
cp1079 and cp1080 just need normal depooling process here.
lvs1014 here will need special care, Traffic should stop puppet and pybal and monitor failover to lvs1016 ahead of work, then revert afterwards. cp1081 and cp1082 here can be depooled as normal.
(task desc edited for correct cp nodes: this rack has 77/78, not 76/77)
The Traffic nodes cp1077 + cp1078 can be depooled the usual way, but lvs1013 needs some special care. Someone from Traffic should handle and monitor that just in case (basically we need to manually disable puppet and stop pybal a few minutes in advance of the work, verify traffic moving correctly to lvs1016, and then put everything back to normal afterwards).
cp1076 - Can depool ahead of work and repool later, with the local commands "depool" and "pool"
lvs100 - Not in use and should be decommed, but this ticket made me realize we haven't made an lvs1001-6 decom ticket yet (will do shortly!)
These have been in-service for a while now, closing!
Jul 19 2019
Jul 18 2019
Oh one more thing that should've been (3) on that list:
I like the end result here, and I don't think it's problematic from the Traffic perspective in the long view, but I think the initial rollout isn't so trivial:
Jul 17 2019
Jul 16 2019
Jul 3 2019
Thanks for chasing this down! After fixing up any further sources of the extra sessions: do we have to do something about clearing out the excess sessions from storage (redis?), or is this mostly an ephemeral sort of problem?
Jul 2 2019
@Anomie / @Legoktm - Can you take a look at this? We're out of our depth over here trying to figure out this bug. TL;DR is that some logged-in sessions are getting excess (in the example above, ~50) Set-Cookie headers for auth sessions, with many repeats for the same wiki with different session id numbers, to the point where it's causing us real problems.
Jul 1 2019
Re-setting this to at least High for now, given the criticality of the component involved and the production impacts.
Jun 28 2019
Jun 26 2019
I don't think anyone's 100% sure how we're handling this project, but probably Traffic will figure out the setup for these and ask Alex if we need help. We probably won't get around to it very quickly, can leave them in role::spare for now until we get to it.
Jun 19 2019
Implementing a blanket redirect to the legacy blog URI for ^/20(0[7-9]|1[0-8])/ should be feasible in VCL or Lua at the edge. Or alternatively, we could also just leave it alone and pick another hostname, too.
Jun 8 2019
The TLS-level error is just complaining that, at the end of the transaction, the connection was aborted abruptly instead of torn down cleanly. It would probably be more-ideal if gerrit's TLS stack would cleanly close on 500s when it can, but the real issue here is probably the 500 error, not the TLS error. At a glance, the GET request headers look identical in the two cases, so I'm at a loss as to what's happening on gerrit's side here. Is there perhaps a request difference in some HTTP-level authentication or cookie stuff that's not shown in the trace?
Jun 6 2019
@leila and @Miriam - Thanks for all the hard work here, it's truly outstanding the depth to which this analysis already goes, and it puts some useful numbers on the impact of expanding our edge network into under-served regions.
May 30 2019
That alert basically means that a varnish frontend daemon crashed (and as usual was auto-restarted by a manager process). These are pretty rare and usually worth some investigation.
May 29 2019
The failed reimage was finished up manually (probably not the reimager's fault)
Done. Are we ready to deploy it already or blocked on other MW-level deploys still?
May 28 2019
Plan seems reasonable based on the info in the description! Maybe wait longer than 2h after the linecard is restarted? Or do we suspect that any recurrence is much less likely with no traffic?
May 24 2019
That cloud rebranding link above also mentions wikimediacloud.org, which is yet another option nobody's exploiting yet. So even without getting into the over-long wikimediacloudservices.org, we have sufficient names to cover all the cases here (feel free to re-arrange, esp the latter two):
Ok, @aborrero caught me up on all the context on IRC so I can stop asking dumb questions (Thanks!).
May 23 2019
These are reimaged to role(spare::system) now. Over to @ayounsi for getting rid of all the special cases related to these hosts in the eqiad routers and switches (BGP stuff, fw filters, the special public-vlan LVS-balancer port groups, etc), and then we can move this on to dcops -level decom stuff.
Do these belong in wikimedia.org at all? It seems this has already been discussed, but I guess I lack some context.
A few thoughts:
May 22 2019
Either is fine. I assume you won't be able to do anything else with this (e.g. make https://gsuite-test.wikimedia.org/ work) without some followup records added on our side.
So we've reduced query volume by ~32% in T208263 . Since the last significant updates here, we've also deployed newer versions of our authdns software which perform even better, and refreshed some hardware as well. We're still in the basic scenario that we only have 3x singular authdns hosts in the world, but they're running with plenty of headroom in terms of handling query rate spikes and server outages. There's really two things holding us up on experimenting with lower TTLs for faster failover:
Scheme has been stable for ~1w now and seems to be working out fine. The net reduction in total authdns requests is ~32%. I suspect the drop in public requests for wiki hostnames is greater, as the total also includes all of our internal/infrastructure lookups as well, but either way we should be seeing far less DNS cache misses out there in the world, especially for longer-tail / less-popular project and language combinations.
The above is deployed. I'd wait a full 10 minutes from the time of this comment to re-test, in case they've negative-cached the previous lookup, then try again and let's see what happens.
The context of the second token is that all of our canonical wiki domains, including wikimedia.org, already have persistent Google Site Verification TXT tokens so that we can manage Google Search stuff for our own domains on a different Google system.
@HMarcus - The record is live, can you try the validation and let me know how it goes?
May 21 2019
Nevermind, apparently it was already repooled, looking at the wrong thing here...
It's been up for ~15 days now without incident, but depooled for frontend traffic. Re-pooling it today to see if we can get a recurrence or not.
FWIW, lvs1016 came back with correct settings after the single additional reboot above.
Current status of transition:
May 19 2019
Note https://gerrit.wikimedia.org/r/c/operations/puppet/+/511118 - I had to switch the lvs1015 cross-row ports for rows A and B (enp4s0f1 and enp5s0f0) backwards at the software level to match the physical reality shown by lldpcli show neighbors, which was backwards from the documented table of ports at the top of this task. The current config works and we can keep it if we want. Note that I didn't make any other related changes, so if we keep this config, we probably need to edit the software port labels in the switch configurations to match, and possibly any physical labeling in the DC, to avoid future confusion. Alternatively, before we put this machine in service, we could physically swap the cables back to the intended config at the rear of lvs1015, revert the mentioned puppet patch, and reimage the server again. Either way, there's probably some followup to do on this.