The above took care of it from the Varnish side (seems easier for the case we care about!). It's been live for a bit over 24h now, and stats reflect that it's doing the right things (no obvious errors caused by it, network bandwidth increased from mw*->cp*, cp* CPU% went up a bit, cp* network output stayed about the same). It has probably cleared up some corner-case errors and/or inefficiencies as described in various places in this ticket, too.
Thu, Mar 22
We've talked about this a bit this week. Basic initial steps of the plan at this point are:
Yeah I do have concerns here. It's going to take some time before I can loop back and explain them, but I just wanted to put the note in now that this is concerning on multiple levels...
Tue, Mar 20
Yes, we've had this on the discussion list for ops Q4 goals (the elimination of the need for ipsec for caches<->kafka-brokers), and Traffic signed up to guarantee the time for it. Goals not final yet, of course.
Some notes from digging around in related things:
- Varnish docs claim that duplicate probes (e.g. due to vcl reloads) are coalesced into a single probe, near the bottom of https://varnish-cache.org/docs/5.1/users-guide/vcl-backends.html#health-checks . This appears to be BS, as we can see the multiple probes running pretty clearly.
- Setting the probes of an old VCL to an explicit administrative health (e.g. varnishadm backend.set_health e3032122-59c9-4490-8006-8fbc6ab00198.be_cp1065_eqiad_wmnet sick) doesn't stop the probing. The explicit admin_health of sick here overrides the probe's results, but does not stop the probe from executing.
- There's doesn't appear to be a VCL equivalent of the CLI's set_health command.
Wed, Mar 14
See also the peering information tracked in T186835
Tue, Mar 13
So, recapping this ticket that's been stale for quite a while:
Mon, Mar 12
(I acked those with a ref to this ticket for now, to reduce overall icinga redness)
Fri, Mar 9
Also, depooled for now, since we can't trust the uncorrected memory errors not causing production issues:
16:07 <+logmsgbot> !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp3034.esams.wmnet
See also T183177 (why aren't we getting runtime icinga alerts when these happen, via EDAC?)
Updated with actual target country lists above. Process and batching of this for actual turn-up work still TODO :)
Oh makes sense, maybe the initial image install just has the v4 and RIPE has to configure the v6 during their bringup process?
Thu, Mar 8
(other than DOA cp5006, tracked separately for repair in T187157
With the last merges above, all the known issues that actually belong here are resolved other than 3 cases from the previous list which are now exported to other more-appropriate tickets in T187157#4035174 , T162684#4035182 , and T179042#4035164 .
We're still missing rancid definitions in puppet's modules/rancid/files/core/router.db, ping @ayounsi
Reminder: after hardware level is fixed and the host is installed, we'll need to uncomment its entry in hieradata/common/cache/upload.yaml before it will successfully puppetize and join the cluster.
Ping monitoring for this anchor merged in with: https://gerrit.wikimedia.org/r/#/c/417267/1/modules/netops/manifests/monitoring.pp
Wed, Mar 7
Tue, Mar 6
@Papaul re-seated mgmt console cable, seems to be working now
DHCP works, so interface is fixed, thanks!
Mon, Mar 5
The bottom line is that the value of uri_host is entirely up to the client, and therefore subject to client-side stupidity. It's legal (in all protocol senses) for a client to connect to our public IP address over HTTP or HTTPS (in the latter case, legitimately matching an SSL certificate for e.g. en.wikipedia.org), and then send an HTTP(S) request that looks like...
The shipping company has updated: 05-Mar-2018 18:34:00 SGT Proof of Delivery Rcvd
Sun, Mar 4
Note that successful non-HTTPS requests evading our standard HTTPS redirect code are still possible under some circumstances. The circumstances are:
Fri, Mar 2
https://lwn.net/Articles/747551/ has some interesting discussion on related topics, too.
And just to put the nail in the coffin of LVS/IPVS-level concerns being raised in this ticket - if we were to look at replacing IPVS as the underlying (kernel-level) mechanism for our loadbalancers, nftables would probably not be our target. The most-promising avenue for long-term IPVS replacement that fits our feature-needs / perf / flexibility targets, and might stabilize by the time we're ready for such a transition, is the eBPF-based capabilities of XDP ( http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/introduction.html ).
Thu, Mar 1
Wed, Feb 28
(to clarify: that we haven't needed an LVS-monitoring-ips ferm rule before implies that all the other services behind internal LVS (including those behind cache_misc) just have open ports. I'm wondering why this one is special-er).
Some of this conversation is confusing! :) I take it the situation is it's a public service behind cache misc, and our internal LVS ranges are used *behind* cache_misc to route to multiple backends with healthchecking.
Tue, Feb 27
The hard part here is mapping out the necessary network ports correctly:
Gave up on these machines!
Gave up on these machines!
Fri, Feb 23
I've tested setting the HTTPS_PROXY environment variable before a manual script run from eqsin, causing the request to be proxied through a generic proxy server in eqiad, and this makes the script succeed from eqsin. That pretty much rules out all other possible causes except some form of network address whitelist/blacklist going on somewhere on the zerowiki side.
Merged your patch (thanks). New failure in eqsin is:
Assuming it is a whitelist of the private networks containing prod caches, the new additions to the list for ipv6+ipv4 would be:
https://stackoverflow.com/questions/13578428/duplicate-headers-received-from-server is old but talks about this issue with commas in filenames and C-D header used for attachments. Note the example URLs above contain commas encoded as %2C , which probably points at something similar.
Feb 21 2018
Should we do something here? The same crash can exist at remote DCs as well (the frontends would crash if all local backends are depooled). Clearly there should be a depool_threshold sort of behavior here for backend depooling, but I'm not sure which layer we should inject it at. Perhaps confctl? Perhaps the confd VCL go template (/shudder)?
I was the one arguing for cron, on I think the faulty assumption that a VCL had to go cold before it could be discarded. However, apparently that's not the case. You can discard a warm VCL and it will defer the actual discarding until it naturally transitions to cold. Thus, we probably could/should do the discarding as part of the reload operation.
TL;DR - Current solution is a fixed 2s load->use delay. I think we should probably do more here at this point, especially in light of eqsin and other thinking. Probably probes should be enhanced, and the sleep parameterized to match.
Feb 20 2018
This could potentially be a large contributor to memory pressure issues we run into elsewhere, as well (and the inconsistencies around these, which may have to do with average reloads rates vs restart timings, etc), depending on how much memory they end up holding.
In light of: https://blog.wikimedia.org/2018/02/16/partnerships-new-approach/ , we're not going to restructure public subnets around this, as that has long-time-horizon implications. We'll deal with these kinds of issues ad-hoc for the remaining lifetime of Zero.
We actually do use the same cert for both, so we don't need the secondary certs bit.
Feb 15 2018
Sounds about right to me. But let's do the other two in T187158 and T187157 as well and maybe get more value out of the time. cp5006 and cp5010 both have "working" management consoles, but one needs a hard power reset to fixup host power-control issues, and the other needs its primary ethernet (to asw1) checked out (possible SFP re-seat or faulty SFP).
Feb 14 2018
Remaining known stuff, paring down the earlier list: