Page MenuHomePhabricator

Strange NFS client outage on VMs running on cloudvirt1036
Closed, ResolvedPublic

Description

Toolschecker alerted at Thu Sept 17 20:03:58 UTC 2020 for a grid cron job and many other grid crons and services died around that time.

The reason is clear from the dmesg output on tools-sgegrid-master:
NFS issues began at:

[Thu Sep 17 19:53:40 2020] INFO: task sge_qmaster:2128 blocked for more than 120 seconds.
[Thu Sep 17 19:53:40 2020]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3
[Thu Sep 17 19:53:40 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Sep 17 19:53:40 2020] sge_qmaster     D    0  2128      1 0x00000000
[Thu Sep 17 19:53:40 2020]  ffff9f45314eda80 0000000000000000 ffff9f4532253000 ffff9f453fc98980
[Thu Sep 17 19:53:40 2020]  ffff9f4536343080 ffffbea781adfb90 ffffffffafc144b9 ffffffffc075f3c0
[Thu Sep 17 19:53:40 2020]  0000000000000000 ffff9f453fc98980 ffffffffafc19364 ffff9f4532253000
[Thu Sep 17 19:53:40 2020] Call Trace:
[Thu Sep 17 19:53:40 2020]  [<ffffffffafc144b9>] ? __schedule+0x239/0x6f0
...snip...
[Thu Sep 17 19:54:09 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Thu Sep 17 19:54:10 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Thu Sep 17 19:54:12 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Thu Sep 17 19:54:12 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Thu Sep 17 19:54:12 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Thu Sep 17 19:54:12 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Thu Sep 17 19:54:12 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
[Thu Sep 17 19:54:12 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying

and resolved at

[Thu Sep 17 20:04:10 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, timed out
[Thu Sep 17 20:04:10 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet OK

This precise pattern happened on all NFS client VMs running on cloudvirt1036 at that time and nowhere else.

tools-k8s-worker-74 was moved to that host during that window and was still spinning up at the time, so it doesn't have mention of an NFS disconnect. We don't see any significant metrics or errors on that hypervisor (so far) except a drop in CPU/RAM usage (probably because the grid ceased to function and it was all grid nodes).

Event Timeline

Bstorm triaged this task as Medium priority.Sep 17 2020, 10:09 PM
Bstorm created this task.

I don't have a lot to add. The main thing I was doing today was moving VMs to ceph and/or resizing VMs (which entails a move between hosts). We should be alert to that causing problems if this issue recurs.

I double-checked the backup jobs as well but they run at a totally different time.

for the record:

aborrero@cumin1001:~$ sudo cumin --force --timeout 500 "cloudvirt10*" "grep 'RULE_DELETE failed' /var/log/neutron/neutron-linuxbridge-agent.log | head"
27 hosts will be targeted:
cloudvirt[1012-1014,1016-1039].eqiad.wmnet
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====                                                                                                                                                                                             
(1) cloudvirt1036.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'grep 'RULE_DELET...agent.log | head' -----                                                                                                                                                        
line 35: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-PREROUTING                                                                                                                 
line 35: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-PREROUTING                                                                                                                 
2020-09-18 01:14:51.072 51103 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 35: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-PREROUTING
line 35: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-PREROUTING
line 35: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-PREROUTING
2020-09-18 01:14:53.082 51103 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 35: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-PREROUTING
line 35: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-PREROUTING
line 35: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-PREROUTING
2020-09-18 01:14:55.078 51103 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 35: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-PREROUTING
line 35: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-PREROUTING
================     

Mentioned in SAL (#wikimedia-cloud) [2020-09-18T08:50:26Z] <arturo> installing iptables from buster-bpo in cloudvirt1036 (T263205 and T262979)

Mentioned in SAL (#wikimedia-cloud) [2020-09-18T08:59:10Z] <arturo> disable puppet in all buster cloudvirts (cloudvirt[1024,1031-1039].eqiad.wmnet) to merge a patch for T263205 and T262979

aborrero claimed this task.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

With the patches I merged related to T262979: cloudvirts: the rocky/buster combo has iptables/ebtables issues, producing errors when launching VMs (and probably other stuff) I suspect this issue wont happen again.

Closing task now. Please reopen if required!