Fri, Jan 17
Most of this has been configured now, the remaining slightly difficult bit is configuring an alternate SNI cert for the domain on our new ats-tls termination.
Tue, Jan 14
Yes, more or less. The major caveat is some of our caches still have non-redirecting copies of various pages in https://transparency.wikimedia.org/ , but this will sort itself out over the next day or so at the most. To save anyone from trawling through the list of commits above, the changes in effect now are:
+1 from me, this was one of the many things we made the ganeti clusters for :)
Mon, Jan 13
Seems like a good plan!
Thu, Jan 9
Wed, Jan 8
So long as the registry's responses do all the standards-based things correctly (they contain Vary: Accept, and the matching Accept values also match the Content-Type values in the responses), this should Just Work on a functional level.
Dec 18 2019
The wording issues here are actually a bit tricky. We've done several TLS standards upgrades over time, and there are still a few to go:
Or a patch to template this in. The problem is it's implemented from a standard template for the top 30-40 lines, which isn't specific to this case, in an attempt to standardize our error output templates.
Dec 16 2019
External queries now working (note they all return a codfw IP without edns-client-subnet in play, because codfw is closest to my laptop and PROXYv2 is working for sending the "real" client IP from haproxy to gdnsd).
Actually we can't realistically do global monitoring from icinga either, because icinga isn't on Buster and so it doesn't have the right library/tool access to check a TLSv1.3-only service, so we'll have to settle for the per-server NRPE checks for now.
Refactoring the dependencies a little here: Really (2) above's sub-point about shared ticket key rotation won't matter until we're anycasting, so I've made a separate task (+subtask) in T240863 to go look at that stuff later, blocking the anycast work.
Dec 13 2019
All of this is irregular and outside of policies we like to adhere to, but I'll push a zonefile to our nameservers which supports the bare minimum (existing Stanford-hosted IPs for the insecure site http://wikiworkshop.org and the same IP for redirects from http://www.wikiworkshop.org , and nothing else ). At some point after the holidays are over, I'd like to find out what the overall intent and/or plan is here so that we can provide some additional guidance and get this onto some kind of more-acceptable path though.
Dec 12 2019
This is now mostly-working, with heira flag controlling test deployment (currently only on dns4002, which doesn't have any public authserver IPs routed into it at this time).
P9867 <- First internal test query on a prod dns box :)
I'm not even sure what the task is asking for, but yeah in general we're not going to make the sec-warning mechanism comply with all expected valid outputs from all possible APIs/URIs it's covering. It's designed to break things, in a way that at least provides some level of human info on what's going on if someone digs in and looks. The next step in the transition process after this is that whatever agent they're using which is getting the sec-warning output won't be able to establish a connection to our infrastructure at all, which is way more broken than this.
The way it works is that if the connection isn't using TLSv1.2, the user is served a 302 redirect to /sec-warning on the same domain, which in turn returns a cacheable 200 OK with the HTML warning content and the CT header as text/html; charset=utf-8. There are a lot of gory details in the compromises being made by that solution (vs. eg. we could have returned some kind of 4xx error immediately rather than 302->200), but we've learned this is the best pattern to avoid misbehavior of certain bots and scrapers out there in the world which spam-retry xx return codes.
Yes, it's about that $notrack default. My hypothesis is that setting it to true wouldn't break any traffic, wouldn't change the security situation much, but would eliminate a bunch of potential for conntrack table size issues when various services get overwhelmed. Some thoughts about why that hypothesis might be false:
Dec 10 2019
Status: The actual LVS portion of this is now completely removed globally. The IP addresses themselves are also completely unconfigured and removed from service at the all the edge sites, but not the core ones. What remains is that the legacy LVS recdns IPs 184.108.40.206 (eqiad) and 220.127.116.11 (codfw) are still statically-configured to avoid breaking any of the leftover dependencies on these IPs. Sniffer monitoring has shown at least the ircd instance on kraz is still using outdated resolv.conf data and hitting these IPs, several hardware PDUs are using them as well, and there are possibly other such cases which are rarer and thus harder to observe in short samples (I've done up to 1h samples).
I'm assuming that, for now, the hosting of the web service (and email?) is not moving, just the whois ownership and DNS service? We usually need a fair bit more information than this to handle such a case smoothly. At a glance it looks like there's potentially more to this (e.g. they have MX and SPF records, are there are also DMARC and such we need to copy?). Also, basic TLS doesn't seem to work on the target site, either. Is there a project-level task or something for whatever transition is happening here?
Dec 9 2019
Dec 6 2019
Dug into the odd cases from install2002 and kraz - the common pattern here is that there are some daemons in the world which both (a) parse /etc/resolv.conf for themselves because they use their own custom DNS client code and (b) don't ever re-read that file if it changes. A few of those are daemons we actually use, which happen to have not had their daemon (or the host) restarted since our resolv.conf was switched to the new recdns IP a few months ago (~Aug-Sept timeframe, it was rolled out at different times to different places).
In a sample I just took across all recdns for a little over 15 minutes of sniffer time looking for requests to the legacy LVS-based recdns IPs:
- ulsfo, eqsin, and esams had no traffic to them at all (yay! and makes basic sense)
- eqiad had a handful of requests from:
- codfw had more-interesting traffic from:
This is still something we want to pursue, but we really need to get past the smooth repooling issue first, so I've added that as a subtask (consider it blocking this one).
Since we haven't updated this in two years, I figured I should post again:
While we'll work on improvements that make this less-likely in the first place in various DNS infra tickets like T171498 , and we're happy to help debugging this with someone, ultimately this is an applayer problem independent of our DNS infra, so I'm moving it over to the Watching column.
Declined in favor of netbox integration ( T233183 ? ) making this problem go away.
Thoughts from the main text of the merged ticket: ------------
Sorry I hadn't remember we had this existing ticket. Will merge into the other newer one since it has patches already and some deeper context, and copy the main text over.
These are still present AFAIK, and we're fairly certain it's just due to pybal healthchecks using blank/broken TCP connections to monitor them. That will be cleaned up in T239993 when we get rid of LVS-based recdns.
In these past couple of weeks we've had a real about-face on this issue, and I think there's a pretty strong consensus and rationale to pursue some kind of host-level caching, but there are details to sort out. Some of the data points to bring this argument up to speed:
I'm not sure how long it's been fixed in our infra, but it definitely works correctly now in our new buster 4.1 installs:
Dec 5 2019
(12) authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org ----- OUTPUT of 'cat /etc/debian_version' ----- 10.2
Dec 4 2019
In general, usually applayer DNS caching is a Bad Idea unless it's done very carefully (e.g. cap it at something like 5s max, or actually use a full-featured resolver library and get the real TTLs from upstream, or both).
With T236479 closed, ganeti3003 is no longer special and everyone can ignore the IMPORTANT NOTE earlier.
Our ns2 service address is now re-routed to dns3001, and ganeti3003 is reimaged back to spare::system.
Dec 3 2019
Dec 2 2019
pdns-rec-exporter fixups in: https://gerrit.wikimedia.org/r/#/c/operations/debs/prometheus-pdns-rec-exporter/+/554155/ + https://gerrit.wikimedia.org/r/#/c/operations/debs/prometheus-pdns-rec-exporter/+/554156/
gdnsd rebuilt again using a combo of https://github.com/gdnsd/gdnsd/tree/v3.2.1 + https://github.com/paravoid/gdnsd/tree/experimental + minor debian/systemd hacks, as there is no 3.x debian package outside the WMF yet 📦
Where we're at now:
- There are 13x authdns servers participating in authdns-update:
- The 3 traditional ones (authdns1001, authdns1002, ganeti3003) which are role::authdns
- The ten dnsbox hosts (role::dnsbox) that also do recdns and ntp (dns00)
- cumin's A:dns-auth alias targets all 13, whereas A:dns-rec targets only the ten recursors (these aliases currently target underlying profiles, not roles)
- Public authdns service routing is unchanged, with each of the ns IPv4s routed into their usual 3 traditional role::authdns boxes
- The recursors (the 10x role::dnsbox) now use their machine-local authdns instance over the loopback to look up our own names, rather than talking to the "real" ns machines.
- At least some of https://wikitech.wikimedia.org/wiki/DNS has been caught up with reality a bit.
Nov 27 2019
I will float the opinion that while I may have many opinions on code style, bikeshedding between reasonable options for a shared standard is a waste. If there's a standard upstream/outside-world set of common style rules for python3, we should just adopt them and be done with it.
Nov 26 2019
Status update: the blended authdns+recdns(+ntp) role is now nearly-complete in role::dnsbox. There's a hieradata flag profile::dnsbox::include_auth which is only set for dns4002 in hieradata, which causes the authdns functionality to be included in the role. dns4002 is running this way now, and is currently the 4th member of our authdns set for authdns-update and similar purposes (including authdns healthchecks and prometheus stats), but only the local recdns daemon on that box is using it for lookups, there's no public service routed into it. authdns-update has been parallelized with clush for now, to ensure it doesn't get really slow as the count of servers expands.
I think you ran into a temporary blip in some unrelated DNS work (which is already dealt with), not this bug (502 errors can happen for real infra failure reasons, too!)
I think we'll keep them private-vlan only and no tagging, and for the rare cases of "public" service instances we'll use LVS to route the traffic (same for all the edge-site ganeti).
I could go either way on the subject of explicit langlist vs wildcard, really, so long as we're confident the MediaWiki layer handles all unknown language codes (really, unknown random hostname labels...) sanely, including crazy ones like :ffq384f9q8f9qj9j-/\.wikipedia.org or whatever. It would even make some things simpler at the DNS and Caching layers if we could assume that. I'd have to go do a quick audit of our DNS data for all the canonical domains to see what kind of exceptions there are, though.
Seems good so far, has been up a few days and in full service for about a day, without incident. Calling this resolved until anything changes!
@Vgutierrez - I really think, reading the Lua plugin code, that __reload__ in 8.0.x might not do what you'd sanely expect (although it is undocumented). I think the __reload__ hook is actually more like a destructor hook for any custom destruct actions you want to happen before the reload into a fresh Lua context. And reload also definitely doesn't hit __init__ either, which leaves the whole model seeming a little broken if you've got a global initialized to nil in its declaration and then initialized to a real value in __init__. Clearly we're missing something here...
Nov 25 2019
It was observed earlier in the traffic meeting that we're fairly certain that none of our R440 hosts have had this problem more than once, so this may be a "once per server" phenomenon, in which case it's also quite likely this can be pre-empted on the ones that haven't crashed yet by giving them a reboot (e.g. something deep has changed while the servers are live, and they stabilize once they've done a fresh boot with it, possibly a live update of some microcode or firmware?)
Nov 22 2019
So far so good - it has completed all the initial puppetization stuff, which is much further than it got before.
Attempting reimage (see above). If it fails like before, it won't get very far (certainly not into production use).
IMPORTANT NOTE ganeti3003 is temporarily repurposed as a critical authdns server and is in live production use for that role (see also: T236479 ). Do not reimage or touch ganeti3003. The other two (ganeti3001 and ganeti3002) are free to image and set up as a 2-node ganeti cluster, with the third node to join later when its temporary duties are complete.
Nov 20 2019
There were two TLS-level changes to the certificate output for esams specifically, each of which bumped the output size (the size of bytes we send the client during the TLS handshake) by a small amount, but either could've pushed us over the boundary for an extra packet (although I wouldn't think we'd reach IW10). They were https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550463/ and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550564/ . They would've taken effect on the servers shortly after merge in each case (within 30 mins or so, anyways), although only for new TLS sessions going forward from that point. The first was ~14:00 UTC and the second was ~22:30 UTC, both on Nov 12.
Nov 19 2019
Adding @RLazarus in hopes of nerd-sniping him further on this topic...
So the patch above adds it to the queue distribution logic in interface-rps, but there's another piece of the puzzle here, which is setting the hardware's queue count for the interface itself, which is where a little bit of a rabbithole develops...
Nov 18 2019
@Gilles This could also be related to TLS certificate changes that were happening around the same dates, and could be inflating the bytes transferred in handshakes. We have a couple of different ongoing things there (renewals, revocations, vendor changes), and as a result we've seen a handshake bytes-sent increase in the EU for sure (which is temporary, but also unavoidable. Probably in the next week or two we'll see that go back to normal and can then confirm if the stats shift with it again).