Wed, Jun 13
This turned out to be a rabbit hole which would also require upgrading urllib3, so in the end it was worked around in the debmonitor client.
Tue, Jun 12
Load was in the 120 ballpark and there were total of 141 "/usr/bin/python -S /var/lib/mailman/scripts/driver listinfo" processes running.
Imported via Secure Apt (release key is signed by Eric with whom I've signed keys) and added to component/cassandra311 on apt.wikimedia.org
Mon, Jun 11
restbase-dev1004 and -1006 still have puppet disabled with a note about upgrading Cassandra, safe to re-enable? Ping me when you're around and I'll upload the new 3.11.2 debs to apt.wikimedia.org
No new errors have been logged in SEL and the server appears stable, closing the task.
Fri, Jun 8
dbus is installed by default and while I've successfully tested restarts on a number of servers (sodium, db2093, dns4001, mw1318, ores1001), we're erring on the side of caution and are not adding it to automated restarts:
Thu, Jun 7
Repooled, seems fine so far.
@Cmjohnson Not seeing a new CPU error logged in "racadm getsel", but it's also still depooled and thus not receiving traffic (and may show up only under load). Unless you wanna do additional tests, I would go ahead and repool it?
Tue, Jun 5
There's one remaining bit we need to clean up; I'll take care of it this week: Our puppet templates for sshd still configure a DSA host key, that should also be cleaned up.
Mon, Jun 4
@Marostegui : hpssaducli is present in the thirdparty/hwraid component for stretch already.
@MelodyKramer The list has been created, let me know if you need any additional tweaks.
Patch looks fine, but this request needs to pass the three day waiting/review period.
@Gilles You can now log into stat1004.eqiad.wmnet/stat1005.eqiad.wmnet.
Fri, Jun 1
@bmansurov I changed your group membership, please retry.
There's quite a bit of crom spam by planet2001:
Thu, May 31
Wed, May 30
Test servers for Parsoid: https://phabricator.wikimedia.org/P7187
Ack, the analysis and the proposed fix seem entirely correct.
Given that the change is now live, shall we close this ticket or do you expect another update soon for the detection of character-based languages?
Test servers for Hadoop cluster at https://phabricator.wikimedia.org/P7186
Test servers for the elastic clusters at https://phabricator.wikimedia.org/P7184
Tue, May 29
@Gilles: Since this is a non-sudo change, it needs to only pass the three day waiting period.
Thanks, I've repooled the sever. I'm keeping an eye on it throughout the week whether it now holds fine.
Should we split this into three tickets since the actionables (and people acting on those) are fairly disjunct? (So one task to remove it from the Math extension, one task to remove it from the production installation/repositories and one to clean up the database)
Mon, May 28
Status update: Half of our active data centre and the majority of servers in our backup DC have been upgraded to wikidiff 1.7. The rest will follow tomorrow.
The server is out of warranty since January. @Papaul: Do we have any decommissioned servers from which we could swap the broken CPU?
Fri, May 25
We could also avoid downtime by temporarily reusing mw1298 (former image scaler) and reinstalling it as phab1002 with stretch. Then we can switch to phab1002 and reimage phab1001 (with an eventual switchback to phab1001/stretch) without having a Phabricator downtime of > 2 hours. The specs are roughly the same, phab1001 has a slightly more powerful CPU than mw1298, but both have 64 GB RAM and looking at Prometheus CPU usage is usually ~ 25% so that be fine.
Thu, May 24
Six out of the mw* servers have been switched to using microcode updates.
The mwdebug have also been upgraded.
@Lea_WMDE, @WMDE-Fisch : The canary application servers have been upgraded and so far everything looks fine in the logs. I'll keep an eye on it, but I think we're good to proceed with the wider rollout next Monday.
@Lea_WMDE Ack, I'll start upgrading the mediawiki canaries later the (CEST) afternoon.
PHP 7.2 packages for stretch are available since early March via thirdparty/php72, let me know if anything is missing
Wed, May 23
@Lea_WMDE : Sure, no problem!
Tue, May 22
May 16 2018
Another case I found: ms-be2013-ms-be2021 were unable to install the systemd update that was released via stretch-updates and it turned that that stretch-updates was missing in /etc/apt/sources.list. Those were among the first stretch hosts, so this was probably a one-off issue in early installations only. I've fixed up the apt config on the affected systems.
We could also simply avoid X-Powered-By at the source; our PHP configs already use "expose_php=off" and for HHVM per https://github.com/facebook/hhvm/issues/2343 adding "expose_php = 0" to server.ini would be the HHVM equivalent.
May 15 2018
We have two clusters which need updated microcode to provide support for the new IBPB instruction needed to secure KVM instances against Spectre. In addition to that keeping the microcode updated it also recommended by upstream since microcode updates can also address functional CPU issues.
May 14 2018
And, let's do this on friday, that leaves us until monday's SWAT (if any).
I suggest we do the following:
- Pick a date/time frame of a few hours where no deployments are happening (or cancel existing ones)
- We switch the deployment server to deploy1001 and test deployments using the PHP7 setup present there
- If anything breaks, we revert to tin and fix whatever problem we found with PHP 7 and deploy1001 and re-attempt at a later stage
- If everything works fine, we keep tin for a few weeks as a fallback and then decom it.
May 9 2018
Host was decomissioned in T187190