We actually do have some upcoming projects which might necessitate more Ganeti capacity. In general the plan is to move all the non-ganeti DNS boxes into ganeti as well if possible, and to spin up DoH instances in ganeti everywhere as well (which may turn out to need multiple instances and have real scaling issues). But we don't need more capacity there *now* just yet, and so long as they're kept powered up as online spares, we can always deal with the decision to move them into the cluster at a later time.
Thu, Jan 14
@Cmjohnson - Please do it at your earliest convenience. It's not in the flow of live traffic and doesn't need any "depool" AFAIK (but it is problematic that we don't have it as a reliable backup option!).
Tue, Jan 12
There's some anomalies in network graphs on authdns1001 that I hadn't noticed until today, which go all the way back to Oct 26, which is probably around when this started. I'm not sure if they're artificial or not (nothing seems to be wrong), but I'm going to do a precautionary reboot anyways. More likely than not it's something to do with stats reporting itself, that may have become a bit confused with the root disk out of space and never truly recovered since we never rebooted.
Dec 16 2020
There's probably a lot of context missing here, athough we can gather some from https://www.mediawiki.org/wiki/Okapi and https://meta.wikimedia.org/wiki/Okapi . Perhaps we could get a primer on where the project is at, what temporary purpose these names will be put to, where the IPs will be hosted at, what kind of software stack is deployed, and processes around deployment and management?
Dec 14 2020
We probably should reach out to them and push on this, though. We do have standards that apply ( https://wikitech.wikimedia.org/wiki/HTTPS ), it's just been a while since we've manually audited everything like in https://wikitech.wikimedia.org/wiki/HTTPS/Domains
Dec 7 2020
(I'm guessing they should probably be updated to the correct file, and also to mention that it has to be in state: production before deploying the DNS mock_etc part of things, but I'm not sure as I didn't change that stuff....)
There are comments at the top of the DNS repo's utils/mock_etc/discovery-geo-resources and utils/mock_etc/discovery-metafo-resources about avoiding this scenario by updating things in the correct order. I think the comments themselves are outdated now, as they don't know about the monitoring_setup state and they point at a hieradata file that doesn't exist anymore...
Nov 25 2020
(and to throw another dimension into the matrix of possibilities above - also whether the client is sending a session cookie to Vary on in either or both requests)
I think especially if you start considering how Vary: Cookie works in all the above (both for MW on the related 200 and 304 outputs, and in the caches and our VCL), it's quite murky to me whether all of this works sanely in this case. For a given URI, I think we can assume (or at least hope) that Vary: Cookie would be consistently either emitted or not-emitted with all outputs for a given URI (even 304s). But whether we're tracing a V:C or non-V:C case probably changes how all the above plays out with the bgfetch as well due to vary-slotting, if the original was supposedly cacheable and the followup response has a Set-Cookie (which is hopefully uncacheable).
Nov 24 2020
@Gilles - please excuse the extremely long response! :)
Nov 23 2020
Various related gdnsd fixes were deployed to production with version 3.4.1 of upstream.
Nov 19 2020
@ema - Reminder to me and you both - Can you take a peek at this Monday please?
Nov 18 2020
No reports of the PDF truncations in NEL for ~8 hours now, which is a significant break from recent trends. Can anyone else still repro this in any way?
This should be fixed now!
The proposed changes are live now. It may take a a few hours to confirm that via NEL at our current sample rate. At least my own artificial reproductions seem to have gone away though, for whatever that's worth!
I'm not exactly sure as to why the pattern above emerged, but now I don't think it's relevant at all, just an artifact of the global distribution of various kinds of traffic.
I haven't been able to repro this on a public endpoint from my own home connection, even using the random-fetcher script, but that would all be against one cache in codfw.
Nov 8 2020
Nov 5 2020
Nov 1 2020
Oct 29 2020
All the authdns are restarted with the infinite limit applied. There's been some IRC discussion about a few possible spinoff tickets here:
We can route different URI subspaces differently at the edge layer, based on URI regexes, as shown here for the split of the API namespace of the primary wiki sites:
Oct 26 2020
Notes on the large increase in large_objects_cutoff from late last week:
Oct 21 2020
Recording from IRC for posterity:
11:07 < bblack> so I was checking out https://gerrit.wikimedia.org/r/c/operations/puppet/+/633704 (which is one of the partman cleanup commits, this one affecting our cacheproxy disk layout), and I've fallen back down the rabbithole of swap space considerations 11:07 < bblack> because the new defaults basically take the partman defaults without any specifics, which is apparently going to create a 1G swap partition (at least, until some future change of defaults?) 11:08 < bblack> in the recent past we didn't have swap partitions on the cache boxes 11:08 < bblack> (and of course, modules/base for better or worse has vm.swappiness = 0, along with some other sometimes-questionable tunables) 11:10 < bblack> I get all the arguments for why disabling swap is probably a dumb idea in most common scenarios 11:11 < bblack> but, I think, if we're taking that angle as a reason to configure swap, we'd probably also relax that swapiness=0 setting as well to make it more useful, and perhaps configure something other than a fixed 1G value for wildly varying workloads and phys memory sizes, too 11:11 < bblack> the current setup (yes, do swap, but at a small fixed sizes with swapiness=0) seems to be in some less-ideal intersection of competing ideas 11:12 < bblack> but in any case, rewinding to the cacheproxy case in particular 11:14 < bblack> these boxes have: 384GB of RAM, the bulk of which is hopefully a fast ram cache for http objects, and a 1.6TB super-fast nvme that's used as an http object disk cache as well (it's the backing for the earlier ram cache), and then a mere ~300G of standard-issue (slower) SSDs for this rootfs stuff 11:16 < bblack> we really don't have any hope of a substantially-useful swap config (where e.g. some mostly-idle or inefficient meta-daemons related to monitoring or something might swap out significant RAM that matters to us), the disks we'd potentially use for swap are actually-smaller than the real RAM, and we're very sensitive to the fact that we'd rather risk OOM than have the kernel make a dumb 11:16 < bblack> decision on swap (e.g. algorithmically make the mistake of swapping out some critical varnishd cache memory and then need to swap it back in during an cache hit for users) 11:17 < bblack> but I don't think our new regime of standard configs allows for a swapless setup? 11:23 < bblack> or: we could make some noswap variants so that it's semi-standardized? 11:24 < bblack> I don't want to derail standardization efforts, and I feel like allowing exceptions can turn into a lot of exceptions over time 11:24 < bblack> really we should revisit the swap question in general, but I think that goes well out of the partman-cleanup scope 11:25 < bblack> (but affects partman, too) 11:30 < bblack> for future sorting out of swap questions in general: we could also make the argument to just never configure swap *partitions*, and have some base/standard puppetization create/manage swap *files* on the rootfs instead, which makes tuning and runtime changes simpler, etc. I doubt there's a perf diff we'd care about in that particular case.
That is what I was thinking too, but I'm not sure if the VCL state diagram allows us to see that at the right point in time to make the decision or not. We'd have to store some state with the pass object somehow, at least? The flow in the particular case in question that I was observing is like this:
Oct 20 2020
I stumbled on T266040 while looking at something unrelated, but now I'm remembering that earlier in this ticket, there was some mention of a possible correlation with larger response sizes here in this report as well. The stuff in T266040 would have some statistical negative effect primarily on objects >= 256KB (the hieradata-controlled large_objects_cutoff), and a secondary effect on overall backend caching efficiency for everything else (by wasting space on pointless duplications of content). AFAIK what we're looking at in that ticket isn't something that changed from V5 to V6, though... but I could be wrong, or it could be that some other subtle change in V6 behavior exacerbated the impact.
Oct 19 2020
Oct 16 2020
Oct 8 2020
FWIW, I am in general a fan of REJECT over DROP, especially when there's not even a great obscurity argument, as is the case here. It will be a change for us internally on the debugging woes mentioned, but it's more-correct in the overall, and letting things that will eventually fail do so faster seems like it's always a net win :)
Oct 7 2020
Oct 6 2020
For IPv4 you'll have to break up the <24 cases into /24 containers somehow, because they DNS zones themselves are /24 in that case. Perhaps we have to make some new containers in netbox, which represent the DNS-level abstraction, in this case?
Oct 5 2020
With the dupe merger, maybe we owe a status update here:
Oct 2 2020
Eh maybe a few more to think about too:
Just throwing in some random points/counterpoints to ponder:
Oct 1 2020
@Cmjohnson replaced the SFPs on both ends of this link before my reboot above. Since the reboot, we don't seem to have any abnormal rate of interface failures, neither do we yet observe ProxyFetch failures from pybal's logs.
The link has gotten worse and began flapping up and down rapidly since last update, causing a loss of routing to the row. I've downtimed the whole host now in icinga, disabled puppet on the host, and manually downed the interface to stop the flapping.
Sep 30 2020
Sep 29 2020
This should've been closed back when T250781 closed - all purge traffic now goes via kafka queues and multicast purging is no more. We might have more to do on rate reduction separately in T250205 , but I don't think that needs to hold this ancient, epic, somewhat ambiguous task open.
This is too-stale now and a lot of these bits have been replaced over time and are known to have their deps correct.
Added a subtask for the one Traffic case I can find here, and removing our tag from this.
All the perf tradeoffs and relatively-trivial work aside, the major blocker we still face here is the likely problems created by either of the simplest methods of handling this at the tls / varnish-fe layers:
lvs100 don't exist anymore, removing Traffic from this.
Sep 24 2020
Sep 23 2020
Probably related to the transient memory issues discussed in various tickets: T164768 T165063 T249809 . In any case this is almost a year old with no investigation, at least needs a fresher report at this point.
Declining for now, as multiple implicated parts of the software stack have changed significantly since this report, and nothing was conclusively found back then. Please file a new one if there's more to look at with current errors!