Wed, Dec 12
Sure, this sounds like a sensible candidate for next Q's initial evaluation/migration of buster (we'll investigate/fix up the base layer and also migrate a few systems)
Fri, Dec 7
@colewhite Could you followup with a patch to absent the collector from puppet? Other that, this seems fully resolved, I couldn't find any dashboard where the former metrics were in use.
During the initial installation d-i run the clock-setup component which syncronises the system clock using rdate. The time server is obtained via a DNS alias (for codfw that's ntp.codfw.wikimedia.org) and that alias was stale, it still pointed to acamar, which has since been replaced by dns1001.wikimedia.org. I've pushed a DNS change to fix this, all further installations should have the correct time.
elastic2050 now has a role assigned and thus the current time, is there a server among the 27 you installed which is still in the pristine condition right after the base OS install?
See IRC/internal channel, during the setup of logstash2001.codfw.wmnet it accidentally reused the 10.192.0.104 A record.
Thu, Dec 6
Wed, Dec 5
Tue, Dec 4
Mon, Dec 3
As for spotting remaining Diamond metrics, https://phabricator.wikimedia.org/P7680 contains a Paste with remaining Diamond metric references (based on a script by Timo), the patch which lists "Matched" refers to the dashboard in question. This is what I used to create the task.
Fri, Nov 30
JFTR, It's better to use the wmf-decommission-host script, it also removes the debmonitor host entry (I fixed that manually).
Thu, Nov 29
Wed, Nov 28
exfat-fuse itself is free software (GPL) and part of Debian main. Debian's approach on patents is written up at https://www.debian.org/reports/patent-faq (TLDR; unless patents are actively enforced, they're ignored. Debian has been shipping ffmpeg which implements patent-encumbered algorithms for a long time as well).
Tue, Nov 27
Fri, Nov 23
Status update: Faidon created a patch for Ferm to address this at https://github.com/MaxKellermann/ferm/pull/41. It's not yet reviewed/merged upstream and too intrusive to backport. We'll revisit when a new upstream release with the patches is available (and ideally it lands in buster as well).
The Icinga servers in production are now running 0.9-1~bpo9+1, but the Cron job still needs to be re-instated.
Tue, Nov 20
See the task description, "For labvirt/cloudvirt I'll create a separate ticket as more steps are necessary." This needs a backport of SSBD support for the qemu version cloudvirt is using and some tests for the level of L1TF mitigation required.
Mon, Nov 19
Fri, Nov 16
Thu, Nov 15
IOW: Enabling profile::base::firewall for the role/hosts
Enabling ferm was blocked on some of the labstore hosts as it was difficult to enable it on an already running system, for the new cloudstore* hosts we should enable it from the start. All the ferm rules/services should be available at this point.
Nov 14 2018
@Krinkle You've edited https://grafana.wikimedia.org/dashboard/db/cluster-board-graphite back in October the last time, that dashboard is increasingly obsolete as more and more hosts in production drop Diamond (and https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown should serve as a replacement, can that be removed?)
Icinga is flagging broken memory on 1053, simply leaving a note here as that host is up for decom anyway.
That point release has happened and I upgraded our netinst images earlier the day, so this should be fine to re-install now.
Leszek; you're now using the same SSH key in Cloud VPS as in the production cluster. This is a security risk as WMCS/VPS allows SSH agent forwarding and a malicious privileged user in WMCS can connect to our forwarded agent socket and connect to production on your behalf.
Nov 13 2018
It's fine to upgrade the kernel, I've installed running what was recent when I created the task and those versions are sufficient to fix L1TF, but it's good to move to a newer kernel for additonal bugfixes in any case:
While nagios-nrpe-plugin has been upgraded on the Icinga hosts, the nagios-nrpe source package also builds the nagios-nrpe-server binary, which should also be upgraded for consistency: https://debmonitor.wikimedia.org/packages/nagios-nrpe-server
Nov 12 2018
I think there are also some inconsistencies in the application of the apt pin. e.g. on an-coord1001, stat100 there's _no_ upgrade candidate, while there's one for stat1007.
@Cmjohnson : Per the procurement task, thermal paste is now available?
Indeed, thanks! I've tested this on mw2151 with a the ferm upstream change applied on top of our current package and a backport of libnet-dns-perl 1.17 plus upstream revision r1717. Going to to a few more tests with libnet-dns-perl reverse dependencies and then I'll rollout this fleet-wide.
@hashar: Search is implemented in general: If you click either of "Hosts", "Kernels", "Packages" or "Source Packages" you can search in there.
Where did you search specifically? Does e.g. https://debmonitor.wikimedia.org/packages/jenkins list the installed packages for you?
This was closed prematurely.
I've uploaded 2.138.3 for jessie and stretch. The component names are correct, we've only introduced the concept of dedicated components for thirdparty with stretch. So for jessie it went into "thirdparty" and for stretch into "thirdparty/ci".
Nov 9 2018
Server is depooled for now
I've added you to pwstore, please see https://office.wikimedia.org/wiki/Pwstore for some docs. If you run into any issues, ping me on IRC :-)