We've fixed so many issues over the past few months that I can't even count them :) Thanks all. I did another sweep today and found these that need fixing:
What's the status of this?
Tue, Nov 14
Do we really need all this for an endpoint marked as "experimental"?
Wed, Nov 8
RPKI is all done as far as I know. @mark said he'll create his account later, if at all. I think we can resolve.
Tue, Nov 7
I wouldn't recommend reviving MgOpen for basically the reasons I described in #819026. TL;DR is that it had serious unresolved issues to begin with (hinting, missing Euro sign) and has been abandoned upstream for years. Meanwhile, there are plenty of good and free (as in OFL) fonts nowadays with Greek glyphs, including DejaVu, Liberation, the Google fonts (Droid, Roboto, CrOS), the Adobe fonts (Source Sans/Serif).
Mon, Nov 6
Thu, Oct 26
Image has been downloaded to the install* servers.
Tue, Oct 24
Going one step further to the original assumptions:
there could be a temporary state in which /home isn't mounted yet, a user logs in, /home gets created, and then something whacky happens and the directory is overridden with the NFS mount
The pam_nologin behavior you're reporting sounds very odd indeed. If it's actually the case it will be CVE-worthy! It's an old, popular and well-audited piece of code though, so it'd be surprising to me if the root cause lies with pam_nologin and not somewhere in our configuration. It's not impossible of course, bugs and CVEs do happen :)
Mon, Oct 23
That makes a lot of sense to me. Thanks for all the background work to support this :)
pmacct 1.7.0-1 (with GeoIP2 support too!) was uploaded to sid yesterday. This should be as easy as a backport-and-install now.
We get occasional rare failures depending on the availability of the CT log servers. I don't see a way around this unless we make our cronjobs quite a bit more sophisticated (e.g. ignore transient errors but complain when we get more than X number of errors for N hours).
Fri, Oct 20
Sounds fine to me. Before we resolve this task, let's not forget that we'll need to cleanup our RIPE objects by remove the old route(6) ones.
OK, so APNIC fixed the "57 duplicate objects" situation, so I proceeded with the rest and specifically:
- Updated our objects for the new office address
- Updated to use the right mailbox per object and type (instead of abuse@ everywhere)
- Created route objects for the /24 and /48 with origin: AS14907
- Created domain objects for in-addr.arpa/ip6.arpa (reverse delegation)
- Added the zones (with just SOA) to operations/dns, and verified the delegation works
Thu, Oct 19
IIRC, @mark said that the rack in question doesn't have a secondary PDU. New PDUs for esams are in the budget this year, so I guess this is planned?
For switches/routers we have alerts on Juniper's system/chassis alarms, which we know trips when they lose PDU redundancy, or any kind of other error. I don't think our disk shelves are connected to the network at all, so I don't see how we'd be able to monitor that? Resolving for now, if there is additional work to be done, feel free to reopen :)
@Cmjohnson, both analytics1036 and analytics1037 are still showing PSU redundancy errors. analytics1035 is fine now, though.
Yup, that's fine, as is creating the zones in the DNS and puppet repository (but not do the reverse delegation).
We now have an APNIC account, and we were assigned today this IP space:
Oct 18 2017
Ah! Yes, that all makes sense now, thanks!
nutcracker ships /usr/lib/tmpfiles.d/nutcracker.conf which should be creating the file in (/var)/run. This has been working in production fine for months now. Not sure why it doesn't work in your case, could you troubleshoot a little more and provide more information?
Ideally uprightdiff would detect that at runtime and adjust as necessary. That'd a little difficult with the plain Makefile we have; have you considered switching to autoconf/automake or something fancier/newer than that (Meson, CMake etc.)?
Oct 17 2017
Oct 16 2017
What's the status and what's left here? @herron?
Oct 12 2017
In production for about a week now.
This is all installed and in production for about a week now.
Yeah that's temporary and fine. The test in general is a bit flawed in that way, but we can ignore that for this particular host.
Oct 11 2017
Oct 9 2017
I saw some of these commits fly by. These are obviously well agreed in principle but I think it's important to not have regressions here -- if we remove a service from being monitored by Ganglia, we should have the equivalent metrics in Prometheus and Graphite, and these need to show up in a suitable Grafana dashboard. Has this been taken into account?
Oct 4 2017
We have at least another usage, the Ganeti key (cf. modules/role/manifests/ganeti.pp). This was for legacy reasons -- Ganeti didn't support RSA, but I think it does now, at least in the version in stretch (also available in jessie-backports).
Oct 3 2017
Would it make sense to lower the interval for all role::mariadb::core, irrespective of mysql_role, to make this a simpler target? We can take the extra hit of more frequent checks for all of those hosts, I think.
What is the high-level/human description of the policy we want to enforce for database servers? (e.g "HP servers in the active datacenter need the check to run every 5 minutes, the rest every 10 minutes"). I ask because I'm wondering if/how any of this can be fixed with tools like role/profile classes and facts, without hardcoding specific hosts/ hiera keys.
Oct 2 2017
Taking into account the lack funding for appserver work, as well as the end of the year fundraising and Christmas freezes, the (tentative!) timeline I proposed is:
- Upgrade the appserver fleet (w/ HHVM) to Debian stretch, including the ICU migration, in Q3 FY17-18 (circa February/March 2018)
- Begin PHP7 planning and initial implementation work in Q4 FY17-18, e.g. including a few test servers
- Fund the work in FY18-19 and complete it early in the year (Q1 or Q2 at the latest)
No, no per-process statistics that I know of :(
install1002 and install2002 should be identical, so why one is alerting and the other one isn't? I think there's an rsync from one to the other to keep them in sync, perhaps we aren't passing --delete to it?
I think the proposal is to bump check interval from 1 minute to 5 minutes, right? Any other actionables here?
I'm not sure if a different implementation (like fping) is going to make a difference. check_ping is "slow" because it sends multiple packets over 1-second intervals -- that will always consume real time (but not CPU time) irrespective of implementation.
Given that we're not under load/pressure and increasing the frequency has the potential of hiding issues during troubleshooting, I'd be inclined to leave it unchanged, at 10 minutes. Thoughts/disagreements?
Note that the storage shelves are only there temporarily, for 2-3 weeks. I'll leave the decision on whether to balance power in the meantime to you guys though, you know best :)
Sep 26 2017
@mark confirmed that S/N: TA3717090152 and S/N: TA3717090331 are the new QFX5100 that were delivered at esams a few weeks ago (WMF4201/asw-oe16-esams and WMF4202/asw-oe15-esams). I've updated Racktables to reflect that, although we're still unsure which one is which, so I put the S/Ns at random.
That isn't needed. We import the puppet CA to the host's certificate store in base and it should thus be available as /etc/ssl/certs/Puppet_Internal_CA.pem. Instead of using that though, the preferred, future-proof way to support it would be just using the (c_rehashed) /etc/ssl/certs as the CA path (in OpenSSL applications), or as /etc/ssl/certs/ca-certificates.crt (in GnuTLS/NSS applications).
I may be missing something, but why do we need client certificates? Just setting the CA path to /etc/ssl/certs and the rest of the arguments to NULL should suffice, I think?
Sep 25 2017
Status update: back in April, APNIC had requested documentation supporting that we have or about to have a presence in the Asia-Pacific region. We didn't have any besides our internal ones to support that, so the request has been stalled ever since.
Sep 22 2017
Sep 21 2017
Confirmed from UnitedLayer email:
Assad Kermanshahi, Sep 20, 21:13 PDT
Sep 20 2017
Thank you all (and especially @Whatamidoing-WMF for spearheading this) and sorry for not being very responsive!
beta, CI and other WMCS VPS projects are not environments that either TechOps or WMCS operate and as such, we hadn't incorporated it into our plans of the Salt deprecation (and that's also why it's not listed in our goals). To be honest, I wasn't even aware of this use of Salt, but even if I had known about it, I'm not sure how we could had reasonably do anything about it other than just give you a heads-up, given our unfamiliarity with this environment. Due to dependency on Trebuchet, this was a quarterly goal that was planned and coordinated with Release-Engineering-Team, so I don't think this was a surprise to you regardless? I'm being a little defensive because I see that you made this a subtask of T164780, tagged this as Goal and Operations etc., so I guess you disagree and/or this may be all a surprise to you after all? If not, then feel free to ignore this whole paragraph :)
Sep 19 2017
Sep 18 2017
@chasemp mentioned this odd issue at the meeting today. If there are no (useful?) logs, are there perhaps any hosts that exhibit the non-working behavior or can be easily triggered to? Let me know (here or on IRC) if you reboot and get the broken behavior, and I can attempt to debug or gather more information from the live (broken) system.
All of them? Wasn't the plan to only do it for the few hosts that are important SPOFs? Again, I fear that this gives a false sense of redundancy -- plus, LAGs (and especially multi-chassis, or virtual-chassis ones) are not without their own risks.
I think @elukey and @Ottomata have some plans around the librdkafka version that needs to be deployed fleet-wide, since there's an implicit dependency to the Kafka TLS work. Is node-rdkafka's dependency on 0.9.5 specifically, or >= 0.9? Can we use 0.11.0?
Sep 14 2017
I think as of today, with the latest compiler run (#7882) plus another hotfix (28111a9), all manifests are compatible with the future parser and we can (and should!) migrate all hosts to the future environment, plus CI and the compiler.
@jcrespo, fully agreed that alerts should be actionable and I don't particularly disagree with your alert definitions. This task exists precisely because a long-running forgotten screen caused a real, user-facing outage (we discussed it at an ops meeting at the time).
Sep 13 2017
I pushed and merged a bunch of changes under Gerrit's topic:future-parser today. I also switched a couple of other patchsets to that topic as well, for referencing them easily. For the record, @ema used topic:varnish-future-parser for the Varnish work, but all this has been merged.