Rebuild Toolforge servers that should not have NFS mounted (and with affinity)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	Nov 3 2020, 12:24 AM

Description

Due to puppet ENC failure at one point, all tools VMs ended up with NFS mounted. That wouldn't seem like a huge deal if not for the set of tc rules that come with them (see modules/labstore/manifests/traffic_shaping.pp and the related script modules/labstore/templates/tc-setup.sh.erb). These tc settings dramatically limit network on VMs, making them run well below what one would expect for a server machine that serves scores of other VMs simultaneously (about 240Mbps maximum possible egress for all processes combined) see also T218338. It also exposes the VMs to problems with load and IO issues due to NFS connectivity quirks, which is really not good either.

The only way to be 100% sure all NFS-focused settings are removed is to rebuild the VM (though you can manually increase tc limits and edit /etc/fstab by hand, things are still left behind). Any vms that have the mount_nfs: false hiera key applied should not have NFS now.

Each cluster requires a different method for replication and failover to ensure that we don't lose state. The puppetmaster is standalone and the commits in /var/lib/git/labs/private need to be copied to a new one before cutover or important state will be lost.
VMs currently in that category include:

tools-k8s-etcd-*
- Should be created with an anti-affinity server group
- Has a special image from T267078 g2.cores1.ram2.disk20.4xiops
tools-k8s-haproxy-*
- Additional ask here: apply a strong (not soft) anti-affinity server group to these if there isn't one already and the VIP system used in PAWS (automatic failover) if it isn't already there. see: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Keepalived
tools-redis-*
- Beware, has a server group
tools-elastic-*
- Beware, has a server group
tools-puppetmaster-02.tools.eqiad1.wikimedia.cloud
tools-k8s-ingress-1 and tools-k8s-ingress-2
- Beware, has a server group
tools-prometheus-*
- Should be created with an anti-affinity server group. It's possible something like https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Keepalived could also be applied here, but not sure.
tools-proxy-*
- Should be created with an anti-affinity server group

When any of these are rebuilt entirely with anti-affinity server groups, they can be removed from the spread alarm in puppet.

It may be a good idea to try rebuilding the tools-k8s-control-* servers without NFS as well. Those currently have NFS mounted, and do not yet have the mount_nfs: false setting.

No review has been done regarding how the settings in the tc-setup script may affect the management of the iptables mangle table and similar by calico and kube-proxy, which is really not good either. Unfortunately, that's where we are.

Details

Subject	Repo	Branch	Lines +/-
toolforge.checker: Update list of etcd nodes	operations/puppet	production	+6 -4
wmcs.toolforge.etcd: make sure the etcdctl node is not the new one	operations/cookbooks	master	+5 -0
cloud-vps: Change NFS mounts to default to false	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T262350 bad failure cases for wmcs custom puppet enc
Resolved	taavi	T267082 Rebuild Toolforge servers that should not have NFS mounted (and with affinity)
Resolved	dcaro	T267140 [toolsbeta] Rebuild servers to learn how to take down the services without downtime (and use affinities)
Resolved	taavi	T252239 Rebuild tools-k8s-haproxy-* as an anti-affinity server group
Resolved	taavi	T278541 Toolforge: migrate redis servers to Debian Buster or later
Resolved	taavi	T153810 Make switching Redis server simpler
Resolved	dcaro	T309014 sentinel and puppet overwriting toolforge redis config
Resolved	dcaro	T279723 Remove 2 nodes from the tools-k8s-etcd cluster

Event Timeline

• Bstorm triaged this task as Medium priority.Nov 3 2020, 12:24 AM

• Bstorm created this task.

Anywhere NFS is actually not mounted, things are probably ok. That can also be verified by the right tc commands.

In T267082#6598683, @Bstorm wrote:

Anywhere NFS is actually not mounted, things are probably ok. That can also be verified by the right tc commands.

By that I mean, if you have a clean server without tc rules and mangling enabled, then sudo /sbin/tc -s -d class show dev eth0 will show no output. You'll get a lot about the limits currently in place if they are applied. If NFS isn't mounted, but there's still tc stuff that isn't applied by calico or similar, it is probably best to rebuild.

aborrero added a subscriber: dcaro.Nov 3 2020, 9:55 AM

aborrero subscribed.

dcaro claimed this task.Nov 3 2020, 1:53 PM

I guess we can do toolsbeta first, so we gain intel on how to do the rebuild without causing too much downtime.

dcaro added a subtask: T267140: [toolsbeta] Rebuild servers to learn how to take down the services without downtime (and use affinities).Nov 3 2020, 4:50 PM

Change 639297 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloud-vps: Change NFS mounts to default to false

https://gerrit.wikimedia.org/r/639297

gerritbot added a project: Patch-For-Review.Nov 4 2020, 9:47 PM

• Bstorm updated the task description. (Show Details)Nov 6 2020, 5:15 PM

• Bstorm mentioned this in T267140: [toolsbeta] Rebuild servers to learn how to take down the services without downtime (and use affinities).

Updated requirements a little since we are rebuilding. With the server groups updated, we can remove some of this from the spread alarms.

• Bstorm updated the task description. (Show Details)Nov 6 2020, 5:19 PM

• Bstorm updated the task description. (Show Details)Nov 10 2020, 5:15 PM

Change 639297 merged by Bstorm:
[operations/puppet@production] cloud-vps: Change NFS mounts to default to false

https://gerrit.wikimedia.org/r/639297

Maintenance_bot removed a project: Patch-For-Review.Nov 11 2020, 12:11 AM

• Bstorm mentioned this in T267966: Try to squeeze better performance out of k8s-etcd nodes.Nov 16 2020, 7:23 PM

aborrero renamed this task from Rebuild Toolforge servers that should not have NFS mounted to Rebuild Toolforge servers that should not have NFS mounted (and with affinity).Nov 17 2020, 3:43 PM

• Bstorm updated the task description. (Show Details)Nov 17 2020, 4:14 PM

taavi mentioned this in T278390: Toolforge root for Majavah.Mar 24 2021, 9:49 PM

taavi added a subtask: T252239: Rebuild tools-k8s-haproxy-* as an anti-affinity server group.Mar 26 2021, 12:16 PM

taavi added a project: Toolforge.

taavi added a subtask: T278541: Toolforge: migrate redis servers to Debian Buster or later.

I'm double checking, but I don't think the email server requires NFS.

In T267082#6960458, @aborrero wrote:

I'm double checking, but I don't think the email server requires NFS.

The only open files on NFS are those related to my home directory.

aborrero updated the task description. (Show Details)Mar 31 2021, 2:42 PM

In T267082#6960920, @aborrero wrote:

In T267082#6960458, @aborrero wrote:

I'm double checking, but I don't think the email server requires NFS.

The only open files on NFS are those related to my home directory.

How does the $HOME/.forward configuration work without NFS?

Mentioned in SAL (#wikimedia-cloud) [2021-03-31T15:57:35Z] <arturo> rebooting tools-mail-03 after enabling NFS (T267082, T278538)

aborrero updated the task description. (Show Details)Mar 31 2021, 3:57 PM

RhinosF1 subscribed.Mar 31 2021, 4:25 PM

Change 676373 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/cookbooks@master] wmcs.toolforge.etcd: make sure the etcdctl node is not the new one

https://gerrit.wikimedia.org/r/676373

gerritbot added a project: Patch-For-Review.Apr 1 2021, 1:47 PM

Mentioned in SAL (#wikimedia-cloud) [2021-04-01T15:18:57Z] <dcaro> adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-01T15:36:26Z] <dcaro> Added new etcd member tools-k8s-etcd-9.tools.eqiad1.wikimedia.cloud (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-01T15:43:29Z] <dcaro> Removing etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-01T15:53:30Z] <dcaro> Removed etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs, adding a new member (T267082)

Change 676373 merged by jenkins-bot:

[operations/cookbooks@master] wmcs.toolforge.etcd: make sure the etcdctl node is not the new one

https://gerrit.wikimedia.org/r/676373

Maintenance_bot removed a project: Patch-For-Review.Apr 1 2021, 4:11 PM

Change 676409 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] toolforge.checker: Update list of etcd nodes

https://gerrit.wikimedia.org/r/676409

gerritbot added a project: Patch-For-Review.Apr 1 2021, 4:17 PM

Mentioned in SAL (#wikimedia-cloud) [2021-04-06T08:55:12Z] <dcaro> adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-06T09:53:48Z] <dcaro> Removing etcd member tools-k8s-etcd-4.tools.eqiad.wmflabs (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-06T10:07:16Z] <dcaro> adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-06T10:31:34Z] <dcaro> Removing etcd member tools-k8s-etcd-6.tools.eqiad.wmflabs (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-06T12:59:37Z] <dcaro> Removing etcd member tools-k8s-etcd-7.tools.eqiad1.wikimedia.cloud to get an odd number (T267082)

Mentioned in SAL (#wikimedia-cloud) [2021-04-06T13:11:11Z] <dcaro> Removing etcd member toolsbeta-test-k8s-etcd-7.tools.eqiad1.wikimedia.cloud to get an odd number (T267082)