Thu, Oct 17
Digicert-2019 is now in live use at the esams edge and we have full normal redundancy (for now) among commercial cert vendors.
It's switched to Rebase-if-necessary now
IRC says the meeting was mostly consumed by OKR discussion, it may have been talked about a little, nobody remembers any new big blocker being raised.
Notes from IRC, etc:
Thu, Oct 3
I've been pushing this to my back burner for a few days because it's complicated. My current $0.03 on all related things:
Mon, Sep 30
Fri, Sep 27
Awesome, thank you!
Ping @herron can we move on this? Any current blockers?
Thu, Sep 26
Wed, Sep 25
Sep 21 2019
[removed - someone linked this during an ongoing incident and I assumed it was fresh. These reports are from days ago and my comment was not relevant]
Sep 19 2019
We'll also need to normalize the incoming Accept headers up in the edge cache layer to avoid pointless vary explosions. Ideally the normalization should exactly match the application-layer logic that chooses the output content type. Do you have some pseudo-code (or real code link is fine too) description of how accept is parsed to select content-types?
Sep 17 2019
Still TODO here before resolving: remove the ferm puppetization on the MW hosts that was allowing LVS ssh access
Sep 13 2019
The problem stems from the "Trust" in "Trusted Proxy". The user-agent string isn't a reliable source (can be set to anything by anyone), and ditto for the contents of X-Forwarded-For. So we can't decide to trust XFF contents in the absence of something reliable, and the UA string isn't it. This is why we need a list of source IPs / networks (and a way to keep them updated) to know who we can trust XFF data from.
Right, that would cover cases like install1002 and archiva (and probably many other minor cases we've missed which haven't set off big alarm bells), but we'll still need direct mitigation on the hosts where it matters for inbound (the cpNNNN, gerrit, etc, which probably also has a long tail of cases we haven't really noticed yet).
The URL mentioned at the top isn't a media URL, it actually is HTML content and is a pageview. Try it in your browser: https://commons.wikimedia.org//wiki/File:Arm_muscles_back_numbers.png
Sep 12 2019
T180069 - Ticket from the feature add for pybal itself
What's missing here is turning on BGP peering with all local routers, which is available in our current 1.15 pybal releases. Will fix that up here and then resolve (the rest has been live for a while for all new LVS deploys).
Sep 11 2019
Re-open as this isn't really complete yet, the battery came in and replacement is proceeding. Since @jijiki did this before and claims it's just a depool command, we'll go with that again :)
I've made a temporary MTU-related fixup on the affected eqiad and esams cache hosts. Assuming we understand the issue correctly, it should be resolve the issue for fresh connections (worst case, restart your browser). Can any previous reporters confirm the same continued breakage, or new success?
Sep 9 2019
@ema would know better about how difficult such things are with ATS in particular. I tend not to like this idea in general, though. In the case of some failure causing lots of temporary pointless 404s, it might double up traffic, and it seems like a hacky crutch which we'd come to rely on instead of fixing the real underlying issues. If others feel strongly about it and it's feasible and reasonably-temporary, I can be convinced, though!
Sep 8 2019
Note we don't actually use phabricator for the actual incident response on something like this. There's no need to mess with priorities or send notifications here :)
Sep 7 2019
It was definitely the attack, not a device failure. We won't generally release fine-grained details about an attack publicly, at least not this early and while threats and mitigations continue to be an ongoing concern. While attempting to investigate and mitigate various phases and variants of the attack during various windows of time yesterday, we did take various network engineering steps which shifted global traffic around between our edges, some of which can lead to the confusing analysis results above.
Aug 30 2019
On the broader meta-topics: Long-lived canonical URLs are important, and I think that transparency.wikimedia.org seems like a more-natural fit for that (and to continue printing and publishing it). IMHO, the ideal end-game here* would be to move transparency.wikimedia.org to Automattic hosting completely and have it serve the new content directly, as well as the historical parts, and have the blog's links link into it. The currently-outlined (interim?) setup sends confusing social and technical signals (e.g. to search engines) about which of https://transparency.wikimedia.org/ or https://wikimediafoundation.org/about/transparency/ is the canonical location of the content.
There are two separate things to do here:
Some clarifying points:
Aug 29 2019
4 years later, lots of things have changed for the better, and we're starting to get near the end of this.
Bump - Whomever's in charge of Shopify on our end, can we check if they've added support for includeSubdomains and preload now in some site setting?
It does work fine now, thanks to the new non-canonical redirect service!
Aug 27 2019
Please leave this open for now so @ema can look at a more-permanent fixup tomorrow!
Depooled cp1075 ats-be service via confctl, can someone retry and confirm mitigated?
Assigning to @ema to investigate (yes, this is the live test server for ATS backends for these servers). Most likely the problem is specific to ATS<->docker-registry, probably because the underlying service TLS certificate's SAN list doesn't match the public name docker-registry.wikimedia.org.
@Varnent: For the redirects: just the main https://transparency.wikimedia.org/ URL? Or also the sub-pages like https://transparency.wikimedia.org/content.html ? I haven't yet looked at the content for the move to /historical/, but I assume it's relatively-simple.
Holding on this until early next week, as we have too many decision-makers on vacation this week, and there are policy and security implications to granting DKIM for @wikimedia.org to a third party via Amazon SES.
Aug 21 2019
Aug 20 2019
Just stalling this so that anyone following it doesn't try to pick this up or move with it yet. There's an ongoing email thread about clarifying this task, and we're waiting for at least one person to return from a vacation and provide guidance before we move forward here.
Aug 19 2019
There's perhaps a faulty implicit assumption here that we desire to use one cert for the world and that we'd just "switch" everything to LE. We're currently using the Globalsign cert at all edges due to various problems earlier in the year, but what we were doing in the past and would like to continue doing in the future is using two certs simultaneously from unrelated CAs, and making the split on a per-datacenter basis (with the US sites using GlobalSign, and the non-US sites using LE, in this case).
I'd start with the conftool stuff before moving on to anything that tracks gdnsd's admin_state -driven things. That whole mechanism is likely to be replaced in the next quarter or two on the gdnsd side, and I wouldn't be surprised if we end up driving the new mechanism from conftool by default.
(Also, is the specific TMH fix actually deployed to all groups yet?)
Unassign for now. The actual ask here is unclear in terms of technical details.
Aug 15 2019
General status updates and planning, for this very old ticket which is still on the radar!
Aug 14 2019
As noted in T155359 - WMDE has moved the hosting of this to some other platform, including the DNS hosting (and we never had the whois entry). So this task can resolve as Decline I think (or whatever), but we should use it to track down various revert patches first before we close it up (revert the DNS repo stuff and whatever else we've got going on in various other repos supporting the wikiba.se site).
Aug 13 2019
Looks like it to me :)
Aug 5 2019
May as well link in an earlier related ticket from late last year for more backstory, too: https://phabricator.wikimedia.org/T205609
Again today, causing a small spike of esams-specific 503s and icinga alerts:
So, yes, cloudelastic is correct in DNS for normal lookups. The issue is that the icinga check defines the virtual host entry for cloudelastic monitoring an explicit IP in its configuration, and that IP ends up being the IP of icinga1001, not of cloudelastic. This probably has to do with the puppet host context in which the resource is evaluated.
Aug 2 2019
Re: transitioning away from SLAAC for the current fleet/setup (which I think is probably a good incremental idea, and could happen ahead of the future netbox work to make that transition easier in the future). Some thoughts on accomplishing that:
Aug 1 2019
These are ready to go for dcops-level work!
Decom in T229586
We had a quick discussion and a small informal vote and decided we don't really need this functionality (pinkunicorn) anymore, so we're going to retire it and not replace it.
Jul 31 2019
Heh, apparently I can't even remember things I read and said before even when they're right above me in the same ticket!
Rollout status update: things that are using anycast recdns resolv.conf in production as of 2019-07-31:
- All hosts in edge DCs (esams, ulsfo, eqsin)
- All cp edge cache hosts globally
- All LVS hosts globally
- Canary Mediawiki API and Appserver hosts in both core DCs
- Network devices
- Install-time stuff (as in dhcp settings and Debian installer)
Replying to myself earlier: apparently they're datestamped URIs beginning with /yyyy/mm/, examples being:
TODO list here from my POV, as best I understand things:
The 421 code is deployed and seems to be working correctly, with a fairly small global average rate of somewhere <1 req/sec. This is the most-legitimate thing we can do with these misdirected requests, and it may actually fix some of them if the UA's own confusion is truly at fault, but it may not be able to help if some kind of DNS or HTTPS proxy interference is causing persistent issues. Maybe it will at least reduce error reporting and debugging confusion in such cases, though, as 421 is very specific to this issue (vs generic 404).
Jul 25 2019
All the LVSes are now using the anycasted recdns, which gets rid of the LVS<->recdns dependency loop and simplifies recdns server downtime processes: https://wikitech.wikimedia.org/w/index.php?title=Service_restarts&type=revision&diff=1833705&oldid=1832671
Jul 23 2019
If we need this to work ASAP, probably the most-expedient thing to do would be to patch our puppetization to exclude the patched features from config on buster only, and use the vendor package. Traffic is in the process of moving away from nginx, hopefully by EOQ-ish, after which we won't need the problematic custom package, and the stock vendor package should work fine for other uses of the tlsproxy module (but we're not quite ready enough, yet, to mess with our current solution by removing the WMF package from stretch!).
Jul 22 2019
cp1079 and cp1080 just need normal depooling process here.
lvs1014 here will need special care, Traffic should stop puppet and pybal and monitor failover to lvs1016 ahead of work, then revert afterwards. cp1081 and cp1082 here can be depooled as normal.
(task desc edited for correct cp nodes: this rack has 77/78, not 76/77)
The Traffic nodes cp1077 + cp1078 can be depooled the usual way, but lvs1013 needs some special care. Someone from Traffic should handle and monitor that just in case (basically we need to manually disable puppet and stop pybal a few minutes in advance of the work, verify traffic moving correctly to lvs1016, and then put everything back to normal afterwards).
cp1076 - Can depool ahead of work and repool later, with the local commands "depool" and "pool"
lvs100 - Not in use and should be decommed, but this ticket made me realize we haven't made an lvs1001-6 decom ticket yet (will do shortly!)