Fri, Aug 16
Fri, Aug 9
Our admin module is in serious need of some revamp, I don't trust it to properly handle a rename. Hence, I'd suggest you handle it in in two steps and absent hpham with a subsequent step to re-add phamhi.
Thu, Aug 8
Status: php7.2 currently fails to build on boron due to some build time hostname check which fails on boron, I still need to get to the bottom of that.
Regarding multi-dc, we have four options I know of:
- Or; Push back this problem and migrate from tungsten to webperf1002 first.
- no standby/failover. no backup.
- performance.wikimedia.org/xhgui will remain SPOF.
Wed, Aug 7
The update has been accepted by the Debian stable release managers and was uploded: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=932175#24, so the 9.10 point release for Stretch will contain the updated package.
I supports single instance Cassandra clusters as well (for maps), so all it should take is to add "aqs" to the list of clusters
@jbond added that a fews days ago in https://gerrit.wikimedia.org/r/#/c/operations/cookbooks/+/528133/ :-)
This is blocking the removal of tungsten, what are the remaining blockers/work to do?
We now have the main pool counters running on Buster using the stock Debian package of poolcounter (poolcounter1004, poolcounter1005, poolcounter2003, poolcounter2004), the old Jessie instances have been removed.
The removal in Debmonitor has a similar race to the PuppetDB removal: I seem to be really lucky, hitting two different races in two subsequent decom runs :-)
There's more: Next I ran the cook book for a host for which the dry-run mode had not been used on previously (to rule out that the incomplete dry-run skews the effective run):
(Started at 2019-08-07 08:37:09,859)
After running the deactivate step a second time, poolcounter1003 got correctly removed. Looking at PuppetDB logs there might be some kind of race in PuppetDB:
Tue, Aug 6
From my PoV yes, I've used this multiple times successfully to create Ganeti instances, all further enhancesments can be done via separate patches/tasks.
The nobarrier option wasn't ever supported in the Debian installer. partman-xfs only supports the following options and the last change to that file was 12 years ago :-)
This is complete.
This can wait until HHVM is undeployed, removing myself for now
This is completed and all services not requiring writes have been switched over.
Mon, Aug 5
Please comment and if its ready to start the decom process, check off the boxes and assign to me for followup. Thanks in advance!
Toolforge/Toollabs also uses tmpreaper (but not the puppetised version with the tmpreaper Puppet class). I'm adding @Andrew and @aborrero for comments whether we should keep it open for this or whether it's not worth tracking there.
The patch seems sane, but I'm wondering whether we actually need to pursue this further? tmpreaper is dead upstream (the Debian maintainer keeps it alive a little for security fixes, but the origin of the codebase is a 20 years old tmpwatch RPM from Red Hat) and has significant bit rot on modern systems (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=881725). Notably we only use it on app servers, it seems to have been added back in 2015 to address core dumps from HHVM clogging up /tmp.
Thu, Jul 25
The failed install might be due to https://phabricator.wikimedia.org/T222960#5327461 ?
@herron : You've added her to the wrong group, staff members need to be a member of cn=wmf, cn=nda is for people who have access to PII-relevant data, but are not staff members of the Foundation (i.e. community members or staff of Wikimedia Deutschland).
Wed, Jul 24
Is this limited to an-tool1006 or also other hosts?
Is this limited to the HDFS command or are other commands also affected? Do basic operations like klist work as expected?
@herron: If you add an account to a PII-relevant LDAP group which does not have shell access to the production cluster, it needs to be added to modules/admin/data/data.yaml
Tue, Jul 23
Duplicate of T224572
The following services have been converted to use the read-only replicas:
Fri, Jul 19
This server is still on Jessie, might be the best option to simply reimage as Stretch and re-bootstrap?
Jul 18 2019
We could try rebooting the Thumbor hosts to the kernel version with the SACK fixes, they are currently running with SACKs disabled.
There is thus a possibility for a package to fail to upgrade but be listed as having been upgraded.
Jul 17 2019
Also followed up on the codfw task, but adding here for completeness as well: This looks good to me!
Ack, this looks good to me!
Graphoid is based on NodeJS, so it should be migrated to Node 10 (and thus Stretch) either this or next quarter, see T210704.
On the Debian packaging level there are also no reverse depencies on php-gd or php7.2-gd.
Jul 16 2019
Packages have been synched to thirdparty/ci for stretch-wikimedia.
I've also rebooted the remaining frontends, but with some more data it doesn't actually seem as if this is caused by the disabled SACKs, if e.g. one limits the dashboard to "stat1005" (the blank period is where the server was depooled for the reboot), it seems as spiky as before: https://grafana.wikimedia.org/d/SxmTH3IZk/arzhels-playground?orgId=1&panelId=3&fullscreen&from=now-3h&to=now
The effect is pretty visible for ms-be1005 on https://grafana.wikimedia.org/d/SxmTH3IZk/arzhels-playground?orgId=1&panelId=3&fullscreen&from=now-1h&to=now ; I'll also reboot the other frontends.
I've submitted a proposed update to fix the underlying OpenSSH bug in Debian Stretch: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=932175
I think we can close this, the error didn't reoccur with the subsequent reboots and might have just been a race condition on the OS level.
It's my understanding that this reduces the steps necessary to restart our recursors is now reduced to a simple depool/repool and that the previous, complex approach from
https://wikitech.wikimedia.org/wiki/Service_restarts#DNS_recursors_(in_production_and_labservices) is now obsolete, right?
Jul 15 2019
It got removed from all production hosts (i.e. including cloudstore*) in fcd6990165c7ec8922a531d11782e21f1a5de04f and made specific to Cloud VPS instances with 3afb8303f164ced695dd5977d70c14611d54be7d
Diamond is now gone from production.
Cole fixed the remaining dashboards. Andrew, can you have a final look whether everything works as expected, then we can close the task?
Jul 12 2019
We don't use a lot of disk space on mw servers, let's go with option 2.
Jul 11 2019
VMs have been created.
These updates have been fully deployed:
Jul 10 2019
We discussed this in the SRE Infrastructure Foundations meeting; given that there are other issues with Servermon blocking the Buster migration of the Puppet masters, servermon/netmon1003 can go away now. An alternative solution will be found for the use case described by Alex when the need comes up again.