Page MenuHomePhabricator

NFS issue affecting Toolforge SGE master
Closed, ResolvedPublic

Description

When trying to stop some running job, I see:

tools.persondata@tools-sgebastion-07:~$ jstop process_templatedata
error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
Traceback (most recent call last):
  File "/usr/bin/job", line 48, in <module>
    root = xml.etree.ElementTree.fromstring(proc.stdout.read())
  File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 1345, in XML
    return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
SGE 8.1.9
usage: qdel [options] job_task_list
   [-f]                                     force action
   [-help]                                  print this help
   [-u user_list]                           delete all jobs of users specified in list
   job_task_list                            delete all jobs given in list

job_task_list           job_tasks[,job_tasks,...]
job_tasks               [job_id['.'task_id_range]|job_name|pattern][' -t 'task_id_range]
task_id_range           task_id['-'task_id[':'step]]
user_list               user[,user,...]
ERROR! no option argument

Similar message with qdel:

qdel process_templatedata
failed receiving gdi request response for mid=1 (got syncron message receive timeout error).

Event Timeline

bd808 triaged this task as Unbreak Now! priority.

Even on the Stretch grid master itself qstat is timing out with the same error.

$ qstat
error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).

From /data/project/.system_sge/gridengine/spool/qmaster/messages

03/11/2019 15:00:00|event_|tools-sgegrid-master|W|acknowledge timeout after 600 seconds for event client (schedd:0) on host "tools-sgegrid-master"
03/11/2019 15:00:00|event_|tools-sgegrid-master|E|removing event client (schedd:0) on host "tools-sgegrid-master" after acknowledge timeout from event client list

Mentioned in SAL (#wikimedia-cloud) [2019-03-11T15:42:38Z] <bd808> Rebooting tools-sgegrid-master (T218038)

Mentioned in SAL (#wikimedia-cloud) [2019-03-11T15:47:14Z] <bd808> Hard reboot of tools-sgegrid-master via Horizon UI (T218038)

Mentioned in SAL (#wikimedia-cloud) [2019-03-11T15:53:26Z] <bd808> Manually started service gridengine-master on tools-sgegrid-master after reboot (T218038)

After hard reboot:

$ qstat
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-sgegrid-master.tools.eqiad.wmflabs": got send error
$ sudo -i puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for tools-sgegrid-master.tools.eqiad.wmflabs
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1552319268'
Notice: /Stage[main]/Profile::Toolforge::Grid::Master/Sonofgridengine::Collectors::Hostgroups[@general]/Sonofgridengine::Collector[@general]/Exec[collect-@general-resource]/returns: executed successfully
Notice: /Stage[main]/Profile::Toolforge::Grid::Master/Sonofgridengine::Collectors::Queues[webgrid-lighttpd]/Sonofgridengine::Collector[webgrid-lighttpd]/Exec[collect-webgrid-lighttpd-resource]/returns: executed successfully
Notice: /Stage[main]/Profile::Toolforge::Grid::Master/Sonofgridengine::Collectors::Queues[webgrid-generic]/Sonofgridengine::Collector[webgrid-generic]/Exec[collect-webgrid-generic-resource]/returns: executed successfully
Notice: Applied catalog in 9.22 seconds
$ qstat
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-sgegrid-master.tools.eqiad.wmflabs": got send error

Ran sudo service gridengine-master start and things seemed to get much better.

bd808 lowered the priority of this task from Unbreak Now! to High.Mar 11 2019, 4:01 PM

Lowering from UBN! to high as the base problem seems to have been resolved.

Mar 11 14:53:01 tools-sgegrid-master kernel: [1224330.183189] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532427] sge_qmaster     D    0  2938      1 0x00000000
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532434]  ffff88eeb5504580 0000000000000000 ffff88eeb63ea200 ffff88eebfd98980
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532438]  ffff88eeb637a380 ffff9848c386bb90 ffffffff820144b9 ffffffff81aac9e2
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532441]  00000000000bd815 ffff88eebfd98980 ffffffff82019364 ffff88eeb63ea200
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532445] Call Trace:
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532457]  [<ffffffff820144b9>] ? __schedule+0x239/0x6f0
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532464]  [<ffffffff81aac9e2>] ? update_cfs_rq_load_avg+0x212/0x490
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532469]  [<ffffffff82019364>] ? __switch_to_asm+0x34/0x70
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532472]  [<ffffffff82015170>] ? bit_wait+0x50/0x50
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532476]  [<ffffffff820149a2>] ? schedule+0x32/0x80
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532478]  [<ffffffff82017d4d>] ? schedule_timeout+0x1dd/0x380
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532481]  [<ffffffff82019364>] ? __switch_to_asm+0x34/0x70
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532483]  [<ffffffff82019370>] ? __switch_to_asm+0x40/0x70
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532485]  [<ffffffff82019364>] ? __switch_to_asm+0x34/0x70
Mar 11 14:53:58 tools-sgegrid-master kernel: [1224387.532487]  [<ffffffff82019370>] ? __switch_to_asm+0x40/0x70

Looks like it lost NFS connection somehow. I wonder if that was widespread or local network?

On labstore1004:

Mar 11 14:14:36 labstore1004 kernel: [16755325.654641] EXT4-fs warning: 20 callbacks suppressed
Mar 11 14:14:36 labstore1004 kernel: [16755325.654646] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 11 14:14:36 labstore1004 kernel: [16755325.658822] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 11 14:14:36 labstore1004 kernel: [16755325.658876] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 11 14:14:36 labstore1004 kernel: [16755325.658890] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 11 14:14:36 labstore1004 kernel: [16755325.658902] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 11 14:14:36 labstore1004 kernel: [16755325.672695] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 11 14:14:36 labstore1004 kernel: [16755325.672716] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 11 14:14:36 labstore1004 kernel: [16755325.672730] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 11 14:14:36 labstore1004 kernel: [16755325.672743] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!
Mar 11 14:14:36 labstore1004 kernel: [16755325.672755] EXT4-fs warning (device drbd4): ext4_dx_add_entry:2234: inode #236585023: comm nfsd: Directory index full!

It broke!

bd808 renamed this task from jstop on stretch times out to Debian Stretch job grid master failure.Mar 11 2019, 4:09 PM
aborrero renamed this task from Debian Stretch job grid master failure to NFS issue affecting Toolforge SGE master.Mar 11 2019, 4:20 PM

For record-keeping purposes, we've noticed through this task and T216988 that the stretch grid is especially sensitive to NFS issues where the Trusty grid is more prone to a brief hang that goes almost unnoticed and recovering. The jump in kernel versions, which changes the interpretation of and default mount options, is likely to blame, but nothing has been implemented yet that reduces this sensitivity.

Also for record keeping purposes, it is possible that NFS issues are related to T217280. It could also be related to what I took care of here: T218486#5030023

Has not recurred in last 2 months, so probably an unfortunate NFS issue.