We ended up deciding some time back at the size increase here was both within reason (in network packet / IW10 sorts of senses), and that the compatibility level it affords was necessary. Belatedly closing up this task for now!
This is also being covered (for the public-facing side of things) in https://wikitech.wikimedia.org/wiki/HTTPS/2021_Let%27s_Encrypt_root_expiry , which Johan has kindly copied out to the upcoming Monday Tech News and to https://meta.wikimedia.org/wiki/HTTPS/2021_Let%27s_Encrypt_root_expiry for potential translations!
Mon, Sep 20
Thanks for the clarity, makes a lot of sense!
Thu, Sep 16
The solution to this in the icinga version of this check was to include an additional term in the prometheus query that would cause a null result if the absolute traffic level (before the drop) is below 15K rps, which the AM version doesn't have (perhaps because it can't handle that scheme and needs a real value? I'm not sure if that's the best solution or if the cutoff is exactly where it should be, but looking into it as I go through AM stuff.
Wed, Sep 1
Traffic-Icebox now exists as a new tag with a process-informative description (click it and read!). I've bulk (+silent) moved all open Traffic tickets which had no activity for >= 6 months over to it as a first easy step, which moved 232 of the 379 open tasks (~61%). The moves contained an automated comment to help reduce any confusion, I hope. See example here: T81605#7327085 .
Thu, Aug 26
Aug 11 2021
Jul 2 2021
- dns1001 will need a manual depool so that it doesn't have knock-on effects on all of the other clusters/software in eqiad in other rows. Depool instructions are at: https://wikitech.wikimedia.org/wiki/Anycast#How_to_temporarily_depool_a_server
- The cp servers should be fine (as we have no user traffic flowing in and most other services that would loop through it internally are running in codfw-only now), but they can easily be preemptively depooled to make things smoother and safer. The simplest way to remember how is just to execute "depool" as root on the affected cp nodes themselves before the switch changes, and "pool" after it's complete.
- lvs1013 should probably be taken offline by disabling puppet and stopping pybal. All the LVSes connect to all services in all rows as routers and to some degree any true impact should be covered by other services dealing with things at their own layer(s), but explicitly depooling it before it loses its primary host interface is probably a smart idea!
Jun 29 2021
Jun 8 2021
When we looked into this for the Bird-based anycast stuff, we found that the combination you want for strong service binding is both BindsTo= + After= on the same underlying service (cf https://www.freedesktop.org/software/systemd/man/systemd.unit.html ).
Jun 4 2021
rsyslogd was down for repeatedly segfaulting on startup. I was able to strace the failure and see that it kept segfaulting while reading one of its own files in /var/spool/rsyslog/ on startup, which was probably corrupted somehow during a prior crash. Deleting the spool files let rsyslog start up properly, but I think at this point we're better off reimaging instead of waiting to find (or never find) some other more-subtle corruption.
Jun 3 2021
@RBrounley_WMF I think he's waiting on me, sorry! Will sync up with him
May 20 2021
bump for testing purposes
May 19 2021
May 13 2021
May 11 2021
A lot of what's in that zonefile of course will change for the new DNS setup, or is irrelevant to any smooth transition, etc. The key parts to highlight:
May 6 2021
May 5 2021
These are all pooled now and slowly filling their caches. Optimistically closing this task for now!
The others were in the same state. All are fixed and rebooted now, icinga downtimes are removed, netbox status is set to Active, and confctl weights are set correctly, but the pooled attribute is still set to inactive.
I checked the BIOS/iDRAC settings on cp5013 against https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_Documentation#Initial_System_Setup (+ the one custom setting we use on these modern cps, which is to disable the unused onboard NICs), and ended up making these 3 changes to bring it into conformance:
These are just about ready and running correct puppetization, but don't pool these yet. I think they may have some bad BIOS settings or something, at least related to power mgmt. cpufreq keeps attempting to reset the governor on every puppet run. Will check tomorrow.
Apr 29 2021
Note - https://gerrit.wikimedia.org/r/c/operations/puppet/+/683026 has the production roles and config, but we'll need to reimage them into this rather than just applying it, in order to get the nvme storage and partman set up consistently.
Apr 28 2021
Continuing the thought above: varnishlog data may infer that most of the perf impact could be restored just by extending grace to something like 5 minutes.
Apr 27 2021
Traffic lvs/cp/dns are all repooled, un-downtimed, and green.
Note to our future selves: we forgot to consider the cross-row LVS connections in this downtime: lvs2008 and lvs2010 do not live in row C at all, but had cross-row connections via C2 to reach all the rest of the service hosts in row C!
Traffic stuff (lvs/cp/dns) is depooled, downtimed, and ready for the network fixups.
Apr 20 2021
Apr 7 2021
Mar 31 2021
Mar 30 2021
Seems ok for the ~14h it's been back online so far. I'm going to re-pool this and tentatively resolve the ticket hoping it's a fluke event, but not clear the SEL. If we get a recurrence, we'll re-open and kick this over to dcops.
Mar 16 2021
@bcampbell - Updated with the new record, try again?
Mar 15 2021
Mar 12 2021
@bcampbell sorry for the delays, this has repeatedly fallen through the cracks, but it's reviewed + merged now and should verify!
Mar 10 2021
@crusnov - It's been followed up offline from phab in general with some meetings, the output of which aren't (yet) reflected in phab, if you're just looking for whether it's being ignored or not! :)
Yeah, I tend to prefer option 2 as well. The other option could work in the short term, although maybe we'd want to evaluate whether it's sufficient in all respects (e.g. how early does it renew, and how does it retry on failures, and does it burn up ratelimits aggressively in a failure scenario, in a way that could impact acme-chief use of the same ratelimits, and how do we monitor for renewal failures, etc, etc). It seems easier to just use our standardized solution where we're solving these problems centrally.
Mar 8 2021
^ There was a last-minute change of plans, so we made a last-minute call to expend a little bit of our overcautious-ness budget and do all 7 wikis at once (as opposed to ptwiki separately from the other 6).
Mar 2 2021
@ovasileva Yes, that plan seems reasonable!
Feb 26 2021
Following up a bit on other paths through this problem:
Feb 25 2021
So, to expand a little bit on the text quoted at the top with some initial insights about cutoff vs nuke-limit tradeoffs and some of my current thinking and/or assumptions:
I've spun out T275809 to go into some depth on the #1 part about large_objects_cutoff
Updates on where we're at on some of the pain points above, in terms of solution analysis:
Feb 24 2021
@Joe yeah I'm not sure which layer is causing the logstash appearance there. It's from restbase1019 as a client towards something, maybe parsoid?
Hi serviceops - I've run into some of the effects of this recently and tracked down this ticket, which seems a relevant/recent reference point.
Feb 18 2021
This seems pretty straight-forward operationally, I think we can replicate the techniques used in T256750 for more wikis in general.
Feb 16 2021
Feb 11 2021
I'm probably not up to date on concrete plans built on top of this, but it seems like having the numeric vlan id might be useful metadata here in addition to the abstract name of the vlan (e.g. scenarios where we might do vlan trunking on the main interface of the host and need to see or match that primary-vlan number in some interface setup scripts?)
Feb 1 2021
The more interesting Netbox question here, is what the correct way is to define a new tagged virtual interface that doesn't exist yet (the loop I had with the interface-name dropdown and puppetdb and homer, etc)
It looks like the /27 is what I manually created, and then the /32 was probably patched in later from puppetdb after everything was configured and running.
Jan 29 2021
Jan 28 2021
FTR: Yes it was a netbox thing, and these are now created as:
Jan 20 2021
All appears healthy now and downtimes are removed, and librenms isn't showing those errors on the interface anymore, either. Thanks!
@Cmjohnson - we'll have it ready then.
@Cmjohnson Yes, just let me know a timeframe and we'll get it ready
Jan 19 2021
@Cmjohnson - let me know when you're ready to deal with this, and we'll stop service on the node and fail things over to lvs1016.
Jan 15 2021
We actually do have some upcoming projects which might necessitate more Ganeti capacity. In general the plan is to move all the non-ganeti DNS boxes into ganeti as well if possible, and to spin up DoH instances in ganeti everywhere as well (which may turn out to need multiple instances and have real scaling issues). But we don't need more capacity there *now* just yet, and so long as they're kept powered up as online spares, we can always deal with the decision to move them into the cluster at a later time.
Jan 14 2021
@Cmjohnson - Please do it at your earliest convenience. It's not in the flow of live traffic and doesn't need any "depool" AFAIK (but it is problematic that we don't have it as a reliable backup option!).
Jan 12 2021
There's some anomalies in network graphs on authdns1001 that I hadn't noticed until today, which go all the way back to Oct 26, which is probably around when this started. I'm not sure if they're artificial or not (nothing seems to be wrong), but I'm going to do a precautionary reboot anyways. More likely than not it's something to do with stats reporting itself, that may have become a bit confused with the root disk out of space and never truly recovered since we never rebooted.
Dec 16 2020
There's probably a lot of context missing here, athough we can gather some from https://www.mediawiki.org/wiki/Okapi and https://meta.wikimedia.org/wiki/Okapi . Perhaps we could get a primer on where the project is at, what temporary purpose these names will be put to, where the IPs will be hosted at, what kind of software stack is deployed, and processes around deployment and management?
Dec 14 2020
We probably should reach out to them and push on this, though. We do have standards that apply ( https://wikitech.wikimedia.org/wiki/HTTPS ), it's just been a while since we've manually audited everything like in https://wikitech.wikimedia.org/wiki/HTTPS/Domains
Dec 7 2020
(I'm guessing they should probably be updated to the correct file, and also to mention that it has to be in state: production before deploying the DNS mock_etc part of things, but I'm not sure as I didn't change that stuff....)
There are comments at the top of the DNS repo's utils/mock_etc/discovery-geo-resources and utils/mock_etc/discovery-metafo-resources about avoiding this scenario by updating things in the correct order. I think the comments themselves are outdated now, as they don't know about the monitoring_setup state and they point at a hieradata file that doesn't exist anymore...
Nov 25 2020
(and to throw another dimension into the matrix of possibilities above - also whether the client is sending a session cookie to Vary on in either or both requests)
I think especially if you start considering how Vary: Cookie works in all the above (both for MW on the related 200 and 304 outputs, and in the caches and our VCL), it's quite murky to me whether all of this works sanely in this case. For a given URI, I think we can assume (or at least hope) that Vary: Cookie would be consistently either emitted or not-emitted with all outputs for a given URI (even 304s). But whether we're tracing a V:C or non-V:C case probably changes how all the above plays out with the bgfetch as well due to vary-slotting, if the original was supposedly cacheable and the followup response has a Set-Cookie (which is hopefully uncacheable).
Nov 24 2020
@Gilles - please excuse the extremely long response! :)
Nov 23 2020
Various related gdnsd fixes were deployed to production with version 3.4.1 of upstream.
Nov 19 2020
@ema - Reminder to me and you both - Can you take a peek at this Monday please?
Nov 18 2020
No reports of the PDF truncations in NEL for ~8 hours now, which is a significant break from recent trends. Can anyone else still repro this in any way?
This should be fixed now!
The proposed changes are live now. It may take a a few hours to confirm that via NEL at our current sample rate. At least my own artificial reproductions seem to have gone away though, for whatever that's worth!
I'm not exactly sure as to why the pattern above emerged, but now I don't think it's relevant at all, just an artifact of the global distribution of various kinds of traffic.
I haven't been able to repro this on a public endpoint from my own home connection, even using the random-fetcher script, but that would all be against one cache in codfw.