The perf issues are definitely relevant for traffic's use of haproxy (in a couple of different roles). Your option (making a libssl1.1-dev for bookworm that tracks the sec fixes that are still done for the bullseye case, and packaging our haproxy to build against it) would be the easiest path from our POV, for these cases.
Wed, Nov 29
Followup: did a 3-minute test of the same pair of parameter changes on cp3066 for a higher-traffic case. No write failures detected via strace in this case (we don't have the error log outputs to go by in 9.1 builds). mtail CPU usage at 10ms polling interval was significantly higher than it was in ulsfo, but still seems within reason overall and not saturating anything.
I went on a different tangent with this problem, and tried to figure out why we're having ATS fail writes to the notpurge log pipe in the first place. After some hours of digging around this problem (I'll spare you endless details of temporary test configs and strace outputs of various daemons' behavior, etc), these are the basic issues I see:
Wed, Nov 22
Thu, Nov 9
Tue, Nov 7
I don't suspect it serves any real purpose at present, unless it was to avoid some filtering that exists elsewhere to avoid cross-site sharing of /32 routes or something.
Oct 19 2023
Oct 16 2023
One potential issue with relying solely on MSS reduction is that, obviously, it only affects TCP. For now this is fine, as long as we're only using LVS (or future liberica) for TCP traffic (I think that's currently the case for LVS anyways!), but we could add UDP-based things in the future (e.g. DNS and QUIC/HTTP3), at which point we'll have to solve these problems differently.
Could we take the opposite approach with the MTU fixup for the tunneling, and arrange the host/interface settings on both sides (the LBs and the target hosts) such that they only use a >1500 MTU on the specific unicast routes for the tunnels, but default to their current 1500 for all other traffic? If per-route MTU can usefully be set higher than base interface MTU, this seems trivial, but even if not, surely with some set of ip commands we could set the iface MTU to the higher value, while clamping it back down to 1500 for all cases except the tunnel.
Oct 11 2023
Oct 3 2023
Looks about right to me!
We could add some normalization function at the ferm or puppet-dns-lookup layer perhaps (lowercase and do the zeros in a consistent way)?
Sep 27 2023
Sep 25 2023
To clarify and expand on my position about this thread count parameter (which is really just a side-issue related to this ticket, which is fundamentally complete):
Sep 24 2023
Sep 22 2023
Adding to the confusion: historically, we once used the hostname cp1099 back in 2015 for a one-off host: T96873 - therefore that name already exists in both phab and git history, confusingly.
Reading a little deeper on this, I think we still have a hostnames issue. If those other 8 hosts are indeed being brought from ulsfo+eqsin. Those 8 hosts, I presume, would be 1091-8, and so these hosts should start at 1099, not 1098?
@VRiley-WMF - Sukhbir's out right now, but I've updated the racking plan on his behalf!
Sep 18 2023
Sep 15 2023
There's a followup commit that was never merged, to re-enable pybal health monitoring on all the wikireplicas: https://gerrit.wikimedia.org/r/c/operations/puppet/+/924508/1/hieradata/common/service.yaml
Sep 14 2023
https://grafana.wikimedia.org/d/000000513/ping-offload might be a good starting point (might need some updates/tweaking to get the exact data you want, though)
some sort of rate-limiting configured on the switch-side for ICMP echo, which was IP-aware and didn't count packets from our own internal systems
Sep 8 2023
Reading into the code above and the history more and self-correcting: the ratelimiter doesn't apply to PTB packets, just some other informational packets. Apparently we bumped the ratelimiter first as a short-term mitigation (for all the sites), I guess primarily to avoid what looks like ping loss to our monitoring and/or users, then deployed the ping offloader in some places as well as a better way to deal with it (and I guess at thousands per second, the pps reduction probably is useful, although I don't know to what degree).
The current puppetized tuneables are at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/8ed59718c7a7603b61d7d42e05726fd11dae5eaa/modules/lvs/manifests/kernel_config.pp#49
to reduce load on LVS hosts
Sep 5 2023
This topic probably deserves a ~hour meeting w/ Traffic to hash out some of the potential solutions and tradeoffs, but I'm gonna try to bullet-point my way through a few points for now anyways to seed further discussion:
Jun 12 2023
The more I've thought about this issue, I think we should probably stick with the (very approximate) latency mapping we have, and not try to have a second setup to optimize for the codfw-primary case. I do think we should swap the core DCs at the front of the global default entry on switchover, though, and the patch above makes that spot a little more visible. There shouldn't be any hard dependencies between this and other steps, but it could be done around the start of the switchover process asynchronously.
May 31 2023
We've got a pair of patches to review now which configure this on the pybal and safe-service-restart sides. We could especially use serviceops input on the latter. None of it's particularly pretty, but at least it's fairly succinct and seems to do the job!
May 30 2023
Note: I restored+amended https://gerrit.wikimedia.org/r/c/operations/puppet/+/924342 and merged+deployed it on lvs1018+lvs1020. This seems to work and disable the problematic monitoring that impacts LVS itself.
May 27 2023
Apr 27 2023
I like this direction (etcd). It's not super-trivial, but we've complained a lot even internally about the lack of etcd support for depooling whole sites at the public edge.
Apr 26 2023
Probably needs subtasks for two things:
- Fix "safe-service-restart.py" being unsafe (either it or its caller is failing to propogate an error upstream to stop the carnage, and is also leaving a node depooled when the error happens between the depool and repool operations. At least one of those needs fixing, if not both).
- The whole 'template the local appservers.svc IP into the "instrumentation_ips"' thing at the pybal level, plus whatever changes are needed to use it from the scap side of things (so that it only checks one local pybal, and it's the correct one by current pooling).
The patch has been rolled out everywhere for a little while at this point, should be able to confirm success
We had a brief meeting on this, and I think the actual problem and immediate workaround is actually much simpler than we imagined. We're going to apply the same workaround we did for MediaWiki traffic in T238285 ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/882663/ ) to the Restbase traffic for now. Patch incoming shortly!
Apr 25 2023
I think we need to rewind a step here. We do want mh, but we want it for the current public sh cases (basically: text and upload ports 80+443), and maybe the other three sh cases (kibana + thanos), although we can start with text+upload first and then talk about those others with the respective teams. The current ticket description and patches seem to be going after the opposite: switching the current wrr services to mh via hieradata and spicerack changes. I think this would be actively harmful. sh and mh choose the destination based on hashes of the source address, which is great for public-facing, but would be hasing on our very limited set of internal cache exit IPs (or other internal service clusters for internal LVS'd traffic), and so it wouldn't balance very well at all. One could potentially address that by including the source port in the hash, but it still seems like it would be more-complicated and less-optimal than just sticking with wrr for these cases.
Apr 17 2023
So, the solution quoted from my IRC chat above: that's about making the depool verification code actually track the currently-live "low-traffic" (applayer/internal) LVS routing, as opposed to what it's doing now (which I think checks the primary+secondary for the role as-configured in puppet, which doesn't account for any failure/depool/etc at the LVS layer).
Apr 14 2023
It's awesome to see this moving along! One minor point:
Apr 13 2023
Mar 30 2023
Mar 13 2023
Resilient hashing indeed sounds much better (it seems like that's their codeword for some internal "consistent hashing" implementation), but it doesn't look like our current router OS have it, at least not when I looked at cr1-eqiad.
Mar 6 2023
The redirects are neither good nor bad, they're instead both necessary (although that necessity is waning) and insecure. We thought we had standardized on all canonical URIs being of the secure variant ~8 years ago, and this oversight has flown under the radar since then, only to be exposed recently when we intentionally (for unrelated operational reasons) partially degraded our port 80 services.
Mar 2 2023
Is there a reasonable shopify alternative that meets policy? That would be my question. If there isn't, we're stuck with this policy violation, but shouldn't stop calling it out as a violation. If there is, we should take a look at whether they can also meet our tech policies for our domains. Another viable resolution is simply to move the service to a novel domainname that isn't entangled with our infrastructure (wikipedia-shop.com or whatever), and manage it entirely separately from our production policies and practices.
Mar 1 2023
The intermittent availability of port 80 is part of ongoing operational work, which is why it worked briefly earlier. However, the correct fix from the user POV is to not use port 80 in the first place (use HTTPS, not HTTP).
Please use HTTPS rather than unencrypted HTTP, in any URIs referencing Wikimedia sites, e.g. https://www.wikidata.org/entity/Q42
Feb 23 2023
Maybe worth pointing out (I had an old stale link to this years ago earlier in the ticket), if nothing else because it may cause whomever at shopify to actually reach out to an engineer:
I assume "makes sense" here is probably cases where shopify knows of or has configured actual subdomains of the domain in question, or something like that. In either case, yes, probably pointing out the "preload" requires "includeSubdomains" should help with that part. preload is what we really want.
Looks good to me, and appropriate at the Varnish layer in this case.
Feb 21 2023
Feb 6 2023
Jan 25 2023
With the merge above, I think this issue is at least mitigated for now. It's not a great long-term solution, but it should alleviate the user-facing side of this in practice.
Jan 24 2023
Jan 18 2023
I fixed all these cases noted above for now. Note that in the lvs1017 case, this could've potentially caused a public service outage for IPv6 text-lb. This is because @ipaddress6 was also templated into pybal.conf as the BGP next-hop address. After the fixup, the puppet agent fixed that:
Bump - these issues continue to affect us sometimes. There seem to be some cases where Juniper can mis-route an RA to an interface it doesn't belong on (interface is on vlanX, but gets an RA that should only ever be seen on vlanY). During this past week/weekend's switch issues in codfw, this issue caused all hosts in rack B2 (which are all in the private1-b-codfw vlan 2620:0:860:102:) to receive RAs from the cloud-hosts1-codfw vlan 2620:0:860:118:.
Jan 17 2023
Jan 12 2023
https://gerrit.wikimedia.org/r/c/operations/puppet/+/875897/ ! (apparently someone was already working on this!)
Jan 9 2023
We have the patched package on cp5032 (bullseye). Did some manual testing on it today:
Jan 5 2023
Summarizing some of the lengthy IRC discussion and investigation on this topic (most of which was @Vgutierrez !):
Dec 14 2022
Dec 9 2022
Dec 8 2022
This is completed now. AFAIK all relevant scripts/automations/etc were updated to match. The conftool service keys for cacheproxy nodes are now just cdn, which controls pooling of the front edge port 80/443 pooling towards pybal, and ats-be, which controls pooling of the varnish->ats-be for cross-node chashing (except in ulsfo and eqsin, which have switched to a single-backend model and don't do this part anymore).
Dec 7 2022
Dec 2 2022
Confirmed the same. That was a simple and elegant fix, so I doubt there's any reason to pursue more-complex options! Thank you!
Comparison point: operations/dns is quite a bit different: total byte size is ~1/4 the size (~6MB vs the ~22MB size of netbox-exports), but has ~4x more commit history (~6000 for ops/dns vs ~1500 for netbox-exports). A fresh clone of this from eqsin (from the gerrit server over https) takes ~6s.
Another thing I failed to mention above: for cases like this (automated git clones just for functional data) we could also potentially speed up the initial fetch by doing shallow checkouts (--depth=N), however, the non-smart HTTP protocol doesn't support this.
Nov 4 2022
Arguably we want this down server cache time to be very low or even disabled in the general case. It's not likely that caching the origin outage is going to help us more than hurt us (unless its the only thing that prevents ats-be meltdown due to dead origin A having a bigger effect on unrelated origins B, C, D...)
Nov 3 2022
Bump - we should revisit this, but perhaps after finishing the cache role name cleanup (text vs text_envoy vs text_haproxy...).
Should've been resolved a while back!
Oct 31 2022
Update - ulsfo is repooled this morning, with all new hardware on the new configuration, and has the "single-backend" mode enabled for both clusters at ulsfo. We'll be keeping an eye on hitrates here, and then trying to follow the same pattern in the upcoming eqsin hardware transition.
Oct 27 2022
Oct 26 2022
Updates! Since this ticket was last active, there's been progress on various fronts with this issue:
Oct 13 2022
Is it possible to fake this out with a bunch of trivially-built empty udebs that are in our repo? Or does it have to come straight from debian?
Oct 6 2022
So, we have a need to move on this pretty quickly, as we have 16 new cache hosts in ulsfo pending installs on this, and then 16 more in eqsin right on their heels. In both cases we're facing limited rack space (have to decom-before-install as we go) and we can't store this stuff or put off the hardware swaps for very long. Trying to summarize our easiest path forward based on the above, I see a few things to tackle and some dependencies that kinda put them in this logical order:
Oct 5 2022
Oct 4 2022
Oct 3 2022
I've also found some other breadcrumbs. Runtime buster + 5.10 support is puppetized in modules/profile/manifests/base/linux510.pp. There's instructions about updating installer images for newer kernels in https://wikitech.wikimedia.org/wiki/Updating_netboot_image_with_newer_kernel , but I'm not sure any of that can be followed verbatim. We probably don't want to impact all buster installs, and instead want to create a new "os" named something like buster510 that can be passed as the --os argument to the reimage script.
I see our buster actually has linux-image-5.10.0-0.deb10.17-amd64 available in its repos. It may just be a matter of figuring out how to launch an installer on that kernel, and have it installed as the runtime one as well.