Page MenuHomePhabricator

Rebuild Toolforge servers that should not have NFS mounted (and with affinity)
Open, MediumPublic

Description

Due to puppet ENC failure at one point, all tools VMs ended up with NFS mounted. That wouldn't seem like a huge deal if not for the set of tc rules that come with them (see modules/labstore/manifests/traffic_shaping.pp and the related script modules/labstore/templates/tc-setup.sh.erb). These tc settings dramatically limit network on VMs, making them run well below what one would expect for a server machine that serves scores of other VMs simultaneously (about 240Mbps maximum possible egress for all processes combined) see also T218338. It also exposes the VMs to problems with load and IO issues due to NFS connectivity quirks, which is really not good either.

The only way to be 100% sure all NFS-focused settings are removed is to rebuild the VM (though you can manually increase tc limits and edit /etc/fstab by hand, things are still left behind). Any vms that have the mount_nfs: false hiera key applied should not have NFS now.

Each cluster requires a different method for replication and failover to ensure that we don't lose state. The puppetmaster is standalone and the commits in /var/lib/git/labs/private need to be copied to a new one before cutover or important state will be lost.
VMs currently in that category include:

  • tools-k8s-etcd-*
    • Should be created with an anti-affinity server group
    • Has a special image from T267078 g2.cores1.ram2.disk20.4xiops
  • tools-k8s-haproxy-*
  • tools-redis-*
    • Beware, has a server group
  • tools-elastic-*
    • Beware, has a server group
  • tools-puppetmaster-02.tools.eqiad1.wikimedia.cloud
  • tools-k8s-ingress-1 and tools-k8s-ingress-2
    • Beware, has a server group
  • tools-prometheus-*
  • tools-proxy-*
    • Should be created with an anti-affinity server group

When any of these are rebuilt entirely with anti-affinity server groups, they can be removed from the spread alarm in puppet.

It may be a good idea to try rebuilding the tools-k8s-control-* servers without NFS as well. Those currently have NFS mounted, and do not yet have the mount_nfs: false setting.

No review has been done regarding how the settings in the tc-setup script may affect the management of the iptables mangle table and similar by calico and kube-proxy, which is really not good either. Unfortunately, that's where we are.

Event Timeline

Bstorm triaged this task as Medium priority.Nov 3 2020, 12:24 AM
Bstorm created this task.

Anywhere NFS is actually not mounted, things are probably ok. That can also be verified by the right tc commands.

Anywhere NFS is actually not mounted, things are probably ok. That can also be verified by the right tc commands.

By that I mean, if you have a clean server without tc rules and mangling enabled, then sudo /sbin/tc -s -d class show dev eth0 will show no output. You'll get a lot about the limits currently in place if they are applied. If NFS isn't mounted, but there's still tc stuff that isn't applied by calico or similar, it is probably best to rebuild.

I guess we can do toolsbeta first, so we gain intel on how to do the rebuild without causing too much downtime.

Change 639297 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloud-vps: Change NFS mounts to default to false

https://gerrit.wikimedia.org/r/639297

Updated requirements a little since we are rebuilding. With the server groups updated, we can remove some of this from the spread alarms.

Change 639297 merged by Bstorm:
[operations/puppet@production] cloud-vps: Change NFS mounts to default to false

https://gerrit.wikimedia.org/r/639297

aborrero renamed this task from Rebuild Toolforge servers that should not have NFS mounted to Rebuild Toolforge servers that should not have NFS mounted (and with affinity).Nov 17 2020, 3:43 PM

I'm double checking, but I don't think the email server requires NFS.

I'm double checking, but I don't think the email server requires NFS.

The only open files on NFS are those related to my home directory.

I'm double checking, but I don't think the email server requires NFS.

The only open files on NFS are those related to my home directory.

How does the $HOME/.forward configuration work without NFS?

Change 676373 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/cookbooks@master] wmcs.toolforge.etcd: make sure the etcdctl node is not the new one

https://gerrit.wikimedia.org/r/676373

Mentioned in SAL (#wikimedia-cloud) [2021-04-01T15:18:57Z] <dcaro> adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-01T15:36:26Z] <dcaro> Added new etcd member tools-k8s-etcd-9.tools.eqiad1.wikimedia.cloud (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-01T15:43:29Z] <dcaro> Removing etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-01T15:53:30Z] <dcaro> Removed etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs, adding a new member (T267082)

Change 676373 merged by jenkins-bot:

[operations/cookbooks@master] wmcs.toolforge.etcd: make sure the etcdctl node is not the new one

https://gerrit.wikimedia.org/r/676373

Change 676409 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] toolforge.checker: Update list of etcd nodes

https://gerrit.wikimedia.org/r/676409

Mentioned in SAL (#wikimedia-cloud) [2021-04-06T08:55:12Z] <dcaro> adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-06T09:53:48Z] <dcaro> Removing etcd member tools-k8s-etcd-4.tools.eqiad.wmflabs (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-06T10:07:16Z] <dcaro> adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-06T10:31:34Z] <dcaro> Removing etcd member tools-k8s-etcd-6.tools.eqiad.wmflabs (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-06T12:59:37Z] <dcaro> Removing etcd member tools-k8s-etcd-7.tools.eqiad1.wikimedia.cloud to get an odd number (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-06T13:11:11Z] <dcaro> Removing etcd member toolsbeta-test-k8s-etcd-7.tools.eqiad1.wikimedia.cloud to get an odd number (T267082)

Change 676409 merged by David Caro:

[operations/puppet@production] toolforge.checker: Update list of etcd nodes

https://gerrit.wikimedia.org/r/676409

  • tools-k8s-ingress-1 and tools-k8s-ingress-2
  • Beware, has a server group

Will be done as part of T264221: Upgrade the nginx ingress controller in Toolforge (and likely PAWS)

dcaro removed dcaro as the assignee of this task.Aug 10 2021, 5:01 PM