Confirmed from UnitedLayer email:
Assad Kermanshahi, Sep 20, 21:13 PDT
Wed, Sep 20
Thank you all (and especially @Whatamidoing-WMF for spearheading this) and sorry for not being very responsive!
beta, CI and other WMCS VPS projects are not environments that either TechOps or WMCS operate and as such, we hadn't incorporated it into our plans of the Salt deprecation (and that's also why it's not listed in our goals). To be honest, I wasn't even aware of this use of Salt, but even if I had known about it, I'm not sure how we could had reasonably do anything about it other than just give you a heads-up, given our unfamiliarity with this environment. Due to dependency on Trebuchet, this was a quarterly goal that was planned and coordinated with Release-Engineering-Team, so I don't think this was a surprise to you regardless? I'm being a little defensive because I see that you made this a subtask of T164780, tagged this as Goal and Operations etc., so I guess you disagree and/or this may be all a surprise to you after all? If not, then feel free to ignore this whole paragraph :)
Tue, Sep 19
Mon, Sep 18
@chasemp mentioned this odd issue at the meeting today. If there are no (useful?) logs, are there perhaps any hosts that exhibit the non-working behavior or can be easily triggered to? Let me know (here or on IRC) if you reboot and get the broken behavior, and I can attempt to debug or gather more information from the live (broken) system.
All of them? Wasn't the plan to only do it for the few hosts that are important SPOFs? Again, I fear that this gives a false sense of redundancy -- plus, LAGs (and especially multi-chassis, or virtual-chassis ones) are not without their own risks.
I think @elukey and @Ottomata have some plans around the librdkafka version that needs to be deployed fleet-wide, since there's an implicit dependency to the Kafka TLS work. Is node-rdkafka's dependency on 0.9.5 specifically, or >= 0.9? Can we use 0.11.0?
Thu, Sep 14
I think as of today, with the latest compiler run (#7882) plus another hotfix (28111a9), all manifests are compatible with the future parser and we can (and should!) migrate all hosts to the future environment, plus CI and the compiler.
@jcrespo, fully agreed that alerts should be actionable and I don't particularly disagree with your alert definitions. This task exists precisely because a long-running forgotten screen caused a real, user-facing outage (we discussed it at an ops meeting at the time).
Wed, Sep 13
I pushed and merged a bunch of changes under Gerrit's topic:future-parser today. I also switched a couple of other patchsets to that topic as well, for referencing them easily. For the record, @ema used topic:varnish-future-parser for the Varnish work, but all this has been merged.
Fri, Sep 8
Wed, Sep 6
I agree -- this doesn't look very much loaded. That said, investigating whether our check intervals make sense (in any direction) is still worthwhile. @herron, is your investigation (that resulted into those three subtasks above) done?
Am I right to understand that the current plan is 2 VMs? If so, yeah, that sounds absolutely fine :)
I know a bunch of work happened during the Wikimania hackathon, but what's the status of this?
Mon, Sep 4
@mark assigned asset tag WMF4203 to this device. The image has also been generated (for AS43821) and can be found on install1002.
Wed, Aug 30
Also see T47827, T47828, T47829 and T61142. This task is supposed to be for the smarthost which sounds like a good first step. I'd recommend keeping separate instances for inbound and outbound email for configuration simplicity too (something that we really should do for production as well).
Indeed! Note that ToolForge already has something like that for tool authors that does LDAP calls etc. if I recall correctly, so perhaps these two efforts could complement each other or even coalesced. Let's split into separate relays first, then all kinds of possibilities exist on how to route WMCS emails :)
This is a WMCS task, but since this use case is currently supported by the production mailservers and that's a long-standing problem (and risk) for us, perhaps it's still worth it for prod ops to spend the time for setting it up. @herron, is that something that you could help with?
Honestly... I'm not exactly sure what you're proposing :) Is there a design document or something that describes the architecture of the system you're thinking of implementing?
For the history side of it :), mx1002/mx2002 never existed, it was just me hoping to get around in building additional MXes (and possibly splitting roles, e.g. inbound and outbound) and since adding SANs later costs, I just added them there to be on the safe side. As for mail.wikimedia.org... these was just a made-up subject to avoid picking one out of four hostnames/SANs as subject.
JFTR, since I didn't see it mentioned neither here nor in T142807, how impending is that decomm? Days/weeks/months?
Tue, Aug 29
Is there any progress on this not captured here? I saw that on the recent 5.0 announcement someone asked about the timeline of Electron support, only to get a response that there isn't one.
Mon, Aug 28
Fri, Aug 25
1 ping is going to be too error-prone though :/ A single packet may be dropped for whatever reason on either side or in transport. Especially when talking about cross-DC checks, this isn't too uncommon. We monitor levels of packetloss with smokeping, but we wouldn't want to see a large amount of random hosts alerting when such events happen.
Indeed, 1 minute may be a bit excessive. I'm also not sure of the point of doing 3 checks spaced by 1 minute before alerting either -- that feels useless unless I'm missing something.
It really depends on the server. For some of them (e.g. databases, and especially masters cc @jcrespo @Marostegui) it's probably best to know as soon as possible, in order to depool, fix or take some other action. Furthermore, 4h-8h could cost us a day, if say, it's at the beginning or middle of Chris/Papaul's work day.
Sounds good to me, feel free to go ahead :)
Thu, Aug 24
The reported (by dmidecode etc.) serial number for the system changed from MXQ62300TQ to HZ6BNV8315. I changed Racktables to reflect that. I'm not sure what's our policy supposed to be -- I know that some BIOSes allow you to override the reported serial number, but I don't recall us ever doing that. Maybe @RobH knows more?
Aug 23 2017
@EBernhardson this is all incredibly impressive, kudos!
Aug 22 2017
I was looking for PDU power usage metrics. Since we don't have a Grafana dashboard yet, I tried to query Graphite manually with e.g. this query: librenms.ps*eqiad*.sensor.sensor.current.*.*.sensor. (actually, what we really need is the sum() of that, but it's less obvious to see what's happening in that one).
Aug 21 2017
I don't really mind who owns the service (Services or Readers), as long as it's owned by someone :)
Setting to stalled until we decide what to actually do with the internal CA, as we're considering dropping it entirely in favour of other options.
Fixed for our purposes, we can follow-up on upstream's/Debian's bug reports for the long-term fixes.
Aug 1 2017
(T119654 is a restricted task, I have no access to it)
Jul 27 2017
This has been open for a while :) What new things that our kernels can do do we need and on which systems? Are these a priority now or can they wait until we upgrade that particular set of systems to stertch?
stretch has 1.28, so perhaps it's just simpler to upgrade the LVS systems to stretch, which we'll need to do anyway at some point? We're already running the stretch kernel, and they don't have much of a userspace apart of Pybal and its dependencies.
So I thought about it a little bit and think we can resolve this after all. I don't know of any cases where temperatures are an issue but one that the current IPMI check doesn't catch. Writing yet another thermal check is more work for dubious gains at this point -- and it also means that we'll be checking the same values twice, from two different places, and get unnecessarily spammed on failure.
Jul 26 2017
Jul 25 2017
@Papaul, this needs to be fixed in the server labels and Racktables.
Yeah, I thought about it some more and I concur. 2.15's "SSL" is a joke, but in our case it doesn't matter much as pretty much everything that we send over NRPE is public anyway.
I think this is a good idea overall and that we should be doing that. A few points:
- I'm worried a little bit that this will hide issues like the ones you mentioned under the carpet. The cases where services are latency/failure-sensitive especially are issues we should be fixing. I'm worried that with a local recursor we'll just make them manifest even less often and in even more corner-cases :/
- For the other case of services flooding our recursors, we should probably be gathering statistics from the local recursor and monitor them in a similar fashion as we do in the "central" recursors, right?
- The glibc resolver issues with multiple recursors/timeouts is something we can't get around from addressing I think :( The local recursor can fail (and will regularly fail when e.g. restarting it), so the system needs to operate even without it...
- I think designing our DNS data in a way where we never need to flush caches is a bit too optimistic, but I think the proposed solution of just using cumin for this use case sounds like a perfect fit. I wonder if we could get away with just flushing the whole cache altogether rather than flushing specific records and thus potentially put systemd-resolved back on the table?
Jul 24 2017
This is basically an artifact of the CT logs failing to respond every now and then, which certspotter complains about. It doesn't happen often.