Recently a couple of VMs in the integration project showed up with filesystem corruption, most recently deployment-elastic07.deployment-prep.eqiad.wmflabs.
I ran a search to see if this is a widespread issue. It's not /that/ widespread, but also not totally isolated.
sudo cumin --force --timeout 500 "A:all" "dmesg | grep 'since last fsck'"
shows 11 hosts with 1 or more errors. They are:
commonsarchive-prod.commonsarchive.eqiad1.wikimedia.cloud |
deployment-elastic05.deployment-prep.eqiad1.wikimedia.cloud |
deployment-elastic06.deployment-prep.eqiad1.wikimedia.cloud |
deployment-elastic07.deployment-prep.eqiad1.wikimedia.cloud |
ores-lb-03.ores.eqiad1.wikimedia.cloud |
parsing-qa-01.wikitextexp.eqiad1.wikimedia.cloud |
pub2.wikiapiary.eqiad1.wikimedia.cloud |
toolsbeta-sgeexec-0901.toolsbeta.eqiad1.wikimedia.cloud |
toolsbeta-sgewebgrid-generic-0901.toolsbeta.eqiad1.wikimedia.cloud |
toolsbeta-sgewebgrid-lighttpd-0901.toolsbeta.eqiad1.wikimedia.cloud |
whgi.wikidumpparse.eqiad1.wikimedia.cloud |
Two other instances had the same issue:
tools-sgeexec-0907.toolsbeta.eqiad1.wikimedia.cloud | T290798 |
integration-agent-qemu-1001.integration.eqiad1.wikimedia.cloud | T290615 |
Going forward, we can run sudo cumin --force --timeout 500 "A:all" "dmesg | grep -q -m 1 'since last fsck'" to see if the list is growing. With that command any host with "success" has that error in dmesg. As of 2021-10-04, the number returned from that is 9 (with one host affected down, so it doesn't report "success").