A certificate warning here https://social.imirhil.fr/@aeris/103126273693383568 the user still had the cert 2018 at 2019-11-12 17:07 UTC although the sites served the cert 2019 at that time, I guess because of some cache on their side.
Possibly it would be better to have a longer transition, where the new cert is served and preferred, but the old one is still valid and served.
In live use now.
Thu, Nov 7
Maybe this is closer to a Lua replacement for all of it, although it still has issues!
Reading up on the debug_proxy stuff a bit more.... currently hassium/hassaleh are proxies into mwdebug00, and use the header to select the destination host, and also has some backwards compatibility for older values. We could potentially skip/eliminate the debug proxy layer and handle this directly as well. The underlying mwdebug hosts actually do have TLS configured already (like the non-debug appservers).
Sorry I missed that you already had a patch! But in any case, we only need commenting from cache::nodes to fix up this case (there's no good reason to e.g. churn it out of conftool or the various iptables rules defined from the other stuff).
Wed, Nov 6
Tue, Nov 5
I think this is actually fairly orthogonal to some of the other improvements. Not sure what current/modern thinking is on this either, probably needs re-evaluation. My gut feeling it to lean against bothering with this right now.
With anycast recdns deployed at all sites with fallback routing towards the cores (or to the opposite core, as the case may be), I think we're in pretty good shape here at this point. If there are other specific improvements we want to make, they should probably be re-evaluated in current context and considered in smaller-scoped tickets like T171498
I don't think we'll go the LVS route here.
Yes, this task was long-ago completed. See also https://phabricator.wikimedia.org/phame/post/view/111/wikipedia_goes_100_forward_secret/
We're not going down this road at all. cache::route_table will just go away when all cache backends have converted to ATS in T227432, which doesn't use a tiered setup to reach the origins.
@BBlack: once we deploy the VCL/varnish-kafka chnages we need to change our refine pipeline to read these values, when we deploy those changes values will be available in webrequest table, after that we ill re-do the indexing of the webrequest dataset into turnilo
Agreed, let's not go down that road right here (because we have a burning need for this data pronto), but side note to keep in mind: "one day" is really really soon (like, we'll probably be migrating varnishkafka stuff to ATS next quarter).
Probably this is not the best place to talk about this, but next quarter seems really close :) Who is going to replace Varnishkafka? Is there any plan from Traffic or should we (as Analytics) schedule time for it? I know that with fifo-log-demux the job of the new "atskafka" should be relatively easy (read from a socket and push to kafka) but we really need metrics (like we have now) to build monitoring on top of them. If the migration is so close, can we open a task (if there is not one) and start the discussion? :)
Mon, Nov 4
Nevermind, I see it in the gerrit comments
Hm, would be ok with me, but likely whatever we choose we'll be stuck with forever. I tend to prefer descriptive names in general, but Joseph might more concerned with the efficiency.
Patches above look sane? I went ahead and shortened the key names down to the minimum to prevent bloat at these layers. We can always give them better descriptive names when they're pulled back to e.g. Turnilo in queries. V is Version, K is Key Exchange, A is Auth, C is Cipher. Too short?
Agreed, let's not go down that road right here (because we have a burning need for this data pronto), but side note to keep in mind: "one day" is really really soon (like, we'll probably migrating varnishkafka stuff to ATS next quarter).
Fri, Nov 1
https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status probably needs some cleanup (some of the graphs are empty, there's a note there to ignore icinga errors, etc). Also fix missing doc link on the alert?
We probably don't need to send the reused value (it's not that useful for analysis at this level, IMHO), and we don't need to send the full-cipher value either (that's the original string from which some of these fields are extracted, but doesn't cover the Key-Exchange part fully or the Version at all). All we need is the 4 derived fields: Version, Key-Exchange, Auth, and Cipher. The string "CP-" isn't really descriptive and just an implementation detail. Also I'm assuming from how X-Analytics is set up that the format is k1=v1;k2=v2;.... (equal sign rather than colon).
@Nuria - what you're asking for is something like a combined TLS field with separators? e.g. we contruct a 4-part semicolon-delimited string like: V=1.2;K=X25519;A=ECDSA;C=CHACHA20POLY1305, and then set that in webrequest as the single field TLS ?
Thu, Oct 31
Digging a little deeper on the Net::DNS side of things and the issues with how options parsing in /etc/resolv.conf affects behavior, looking at the actual version of it deployed on db1119 (latest stretch):
Copying over from IRC: looking at the ferm code itself, a couple of things are notable:
I had already manually updated cr-esams, mr1-esams, and asw2-esams, as appropriate for NTP (DNS should've been already-correct on those), I believe.
Wed, Oct 30
~15m delays should be ok for the GeoIP stuff, it was already sync'd to various consuming cache and DNS nodes over the ~30 minute splay window of puppet runs without issue.
It sounds like this particular problem is fixed. Was TMH the only main offender? If so, can @BBlack revert the header limit?
Tried again this morning, but the kernel panics happen too fast to make much progress once the agent starts actually using the NIC (I've only ever had one agent run complete successfully before a crash, out of many attempts). The crashes (and preceding dmesg outputs) are consistently issues with the card and/or driver for the 10G NIC. I'd say this sounds like our familiar firmware-level issue, but I was able to look at ethtool earlier and it seems like the same firmware version which is stable on the rest of the new esams cache nodes. Given the history, perhaps it really is some kind of actual system board error (which was first affecting the PCIe NVMe drive, and is now affecting the PCIe NIC? I'm at a loss on causes, but if it consistently can't make it through a few puppet runs without crashing, something's wrong....
Tue, Oct 29
I've tried imaging, and things mostly work, but I have a hard time keeping it online long enough to get through an initial puppet agent run (or two or three), as the kernel keeps panic-ing somewhere related to the NIC, e.g.
As a batch these servers are complete in general. Note cp3056 had an early hardware issue that prevented progress, but this is tracked separately in: T236497
Mon, Oct 28
Turns out it was simpler than I thought! Should be done here, re-open if it's still not working.
I'll poke at this today since Arzhel's not here (may take a couple hours, squeezing it around meetings)
Fri, Oct 25
Wed, Oct 23
confirming above - @Papaul is correct. The total set of new esams Linux boxes AFAIK is: 16x caches, 3x LVS, 2x DNS, 1x Bastion, 3x Ganeti.
Tue, Oct 22
Thu, Oct 17
Digicert-2019 is now in live use at the esams edge and we have full normal redundancy (for now) among commercial cert vendors.
It's switched to Rebase-if-necessary now
IRC says the meeting was mostly consumed by OKR discussion, it may have been talked about a little, nobody remembers any new big blocker being raised.
@Vgutierrez may have some ideas about how to tackle these, but it's behind other priorities at present (We could manually switch in the LE certs globally in an OCSP service emergency, if that were necessary before this puppetization work were done). We'll probably wait to tackle his until our TLS termination has finished switching over to our new ATS implementation, since that's close on the horizon and the existing puppetization is nginx-based - (T221594 and related).