Page MenuHomePhabricator

Switching to a default of nfs_mount: false caused some stale fstab entries to clean up.
Closed, ResolvedPublic

Description

This may be the root cause of T289655: [Cloud VPS alert][paws] Puppet failure on paws-k8s-haproxy-1.paws.eqiad.wmflabs (172.16.0.191) and is related to T267082: Rebuild Toolforge servers that should not have NFS mounted (and with affinity)

The patterns for the VM names (straight out of cumin) are:

  • paws-acme-chief-01.paws.eqiad1.wikimedia.cloud
  • paws-k8s-haproxy-[1-2].paws.eqiad1.wikimedia.cloud
  • abogott-proxy-canary.testlabs.eqiad1.wikimedia.cloud
  • abogott-puppetmaster.testlabs.eqiad1.wikimedia.cloud
  • krenair-t252721-test.testlabs.eqiad1.wikimedia.cloud
  • paws-puppetmaster-01.paws.eqiad1.wikimedia.cloud
  • tools-acme-chief-[01-02].tools.eqiad1.wikimedia.cloud
  • tools-clushmaster-02.tools.eqiad1.wikimedia.cloud
  • tools-elastic-[1-2].tools.eqiad1.wikimedia.cloud
  • tools-legacy-redirector.tools.eqiad1.wikimedia.cloud
  • tools-prometheus-05.tools.eqiad1.wikimedia.cloud
  • tools-puppetmaster-02.tools.eqiad1.wikimedia.cloud
  • tools-redis-[1003-1004].tools.eqiad1.wikimedia.cloud
  • toolsbeta-acme-chief-01.toolsbeta.eqiad1.wikimedia.cloud
  • toolsbeta-docker-imagebuilder-01.toolsbeta.eqiad1.wikimedia.cloud
  • toolsbeta-docker-registry-01.toolsbeta.eqiad1.wikimedia.cloud
  • toolsbeta-legacy-redirector.toolsbeta.eqiad1.wikimedia.cloud
  • toolsbeta-mail-01.toolsbeta.eqiad1.wikimedia.cloud
  • toolsbeta-puppetdb-02.toolsbeta.eqiad1.wikimedia.cloud
  • toolsbeta-puppetmaster-04.toolsbeta.eqiad1.wikimedia.cloud
  • toolsbeta-services-01.toolsbeta.eqiad1.wikimedia.cloud
  • toolsbeta-sgecron-01.toolsbeta.eqiad1.wikimedia.cloud
  • toolsbeta-workflow-test.toolsbeta.eqiad1.wikimedia.cloud

So 27 VMs. They need a force unmount of the two incorrect scratch volumes with removal of them from /etc/fstab *or* they should have nfs_mount: true set on the VM, which will allow puppet to correct things for the scratch mount (such as for cron servers).

Event Timeline

This is really a combination of servers that were never intended to mount NFS suddenly using the default because of a puppet ENC failure event with the change in defaults AND the rebuild of the NFS system for cloudstore1008/9.

Mentioned in SAL (#wikimedia-cloud) [2021-08-30T23:23:15Z] <bstorm> deleting toolsbeta-workflow-test T289709

Bstorm updated the task description. (Show Details)

All done.

Just to be clear on what I did, most of these were a force unmount of the scratch volumes with hopes of them being rebuilt one day. Only where the VM was just a test machine or it really should have NFS did I set the mount_nfs: true hiera.