Page MenuHomePhabricator

tools-sgeexec-0906 seems down but still holds jobs
Closed, ResolvedPublic

Description

I see AnomieBOT jobs 1002344 and 1228705, both claiming to be running on tools-sgeexec-0906, seem to not actually be running. I was unable to ssh to tools-sgeexec-0906 to check (while ssh to tools-sgeexec-0907 worked fine). qdel -f does not seem to have killed the jobs either.

Event Timeline

[16:08]  <    bd808> !log tools.anomiebot Force deleted stuck jobs 1002344 and 1228705 per IRC request by anomie 

Those stuck tasks are at least gone for you now @Anomie

Ping works from inside the tools project for tools-sgeexec-0906, but ssh hangs.

Console log via Horizon shows a kernel panic

[10572876.902848] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[10572876.906701] IP: [<ffffffffa2e60aac>] lock_get_status+0x5c/0x360
[10572876.910190] PGD 0 [10572876.911364] 
[10572876.912396] Oops: 0000 [#1] SMP
[10572876.914467] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace fscache sch_ingress cls_u32 sch_htb binfmt_misc qxl ttm drm_kms_helper crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drm virtio_balloon evdev joydev pcspkr serio_raw button act_mirred ifb sunrpc ip_tables x_tables autofs4 ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache hid_generic usbhid hid ata_generic dm_mod virtio_net virtio_blk crc32c_intel ata_piix uhci_hcd libata ehci_hcd aesni_intel scsi_mod aes_x86_64 glue_helper psmouse lrw gf128mul ablk_helper cryptd usbcore virtio_pci virtio_ring virtio usb_common i2c_piix4
[10572876.944422] CPU: 2 PID: 27123 Comm: lsof Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3
[10572876.947883] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.12.0-1 04/01/2014
[10572876.951635] task: ffff965dc61b5080 task.stack: ffffb2d512758000
[10572876.954125] RIP: 0010:[<ffffffffa2e60aac>]  [<ffffffffa2e60aac>] lock_get_status+0x5c/0x360
[10572876.959983] RSP: 0018:ffffb2d51275bde8  EFLAGS: 00010286
[10572876.964095] RAX: 0000000000000000 RBX: ffff965ef548cea0 RCX: ffffffffa3618e35
[10572876.967359] RDX: 0000000000000000 RSI: ffffffffa3845d20 RDI: ffff965ec6f84980
[10572876.970136] RBP: ffff965ea931a900 R08: ffffffffffffffff R09: 000000000000000f
[10572876.972999] R10: 0000000000000000 R11: 000000019d8bd52c R12: 00000000000039a4
[10572876.975785] R13: ffff965ec6e2bea0 R14: 0000000000000001 R15: ffffffffa3618e35
[10572876.978634] FS:  00007fd59234ff40(0000) GS:ffff965effd00000(0000) knlGS:0000000000000000
[10572876.981920] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10572876.984190] CR2: 0000000000000030 CR3: 0000000147838000 CR4: 0000000000140670
[10572876.986993] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10572876.989760] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[10572876.992528] Stack:
[10572876.993406]  ffff965ef548ceb8 ffff965ea931a900 ffff965ef548cea0 ffff965ec6e2bea0
[10572876.996423]  ffffffffa3610282 ffff965ea931a900 ffffffffa2e60ec0 0000000000000000
[10572876.999869]  ffffb2d51275bf08 ffff965ef1d1e300 0000000000001000 ffff965ef548ceb8
[10572877.003819] Call Trace:
[10572877.005031]  [<ffffffffa2e60ec0>] ? locks_show+0x60/0xa0
[10572877.007683]  [<ffffffffa2e31636>] ? seq_read+0x106/0x400
[10572877.010359]  [<ffffffffa2e7ac70>] ? proc_reg_read+0x40/0x70
[10572877.013240]  [<ffffffffa2e0b261>] ? vfs_read+0x91/0x130
[10572877.015493]  [<ffffffffa2e0c732>] ? SyS_read+0x52/0xc0
[10572877.017790]  [<ffffffffa2c03b7d>] ? do_syscall_64+0x8d/0xf0
[10572877.020200]  [<ffffffffa321924e>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
[10572877.023268] Code: 8b 40 28 48 8b b0 60 04 00 00 e8 80 74 e3 ff 85 c0 41 89 c4 0f 84 bd 01 00 00 48 8b 43 70 48 85 c0 0f 84 cc 01 00 00 48 8b 40 18 <4c> 8b 68 30 4c 89 f9 4c 89 f2 48 c7 c6 cc 01 61 a3 48 89 ef e8 
[10572877.035233] RIP  [<ffffffffa2e60aac>] lock_get_status+0x5c/0x360
[10572877.037842]  RSP <ffffb2d51275bde8>
[10572877.039283] CR2: 0000000000000030
[10572877.043461] ---[ end trace 37bb73a8ba11ad30 ]---

Tried to get a job list for posterity

bstorm@tools-sgebastion-08:~$ qhost -j -h tools-sgeexec-0906.tools.eqiad.wmflabs
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
tools-sgeexec-0906.tools.eqiad.wmflabs lx-amd64        4    4    4    4 4.82K    7.8G    4.6G   23.9G     0.0
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
   ----------------------------------------------------------------------------------------------
     66432 0.33540 signature- tools.signat Rr    10/14/2020 19:46:12 continuous MASTER
     85368 0.29826 musescore- tools.archiv Rr    11/28/2020 23:19:38 continuous MASTER
   1138789 0.33546 ptwikisbot tools.ptwiki Rr    10/29/2020 17:31:30 continuous MASTER
   4302517 0.28218 iabotwatch tools.botwik dr    12/18/2020 12:49:56 continuous MASTER
   1045324 0.25978 serobot.bo tools.serobo r     01/14/2021 18:41:54 continuous MASTER
   1221698 0.25725 patrolAfte tools.urbane dr    01/17/2021 20:45:09 continuous MASTER
     49590 0.35739 java       tools.sammou r     09/18/2020 01:15:19 task@tools MASTER
    728405 0.26436 cron-tools tools.lp-too r     01/09/2021 05:00:21 task@tools MASTER
   1228437 0.25715 n500v5ts   tools.jarbot r     01/17/2021 23:35:24 task@tools MASTER
   1230056 0.25713 shaher     tools.khanam r     01/18/2021 00:10:24 task@tools MASTER
   1230650 0.25712 zopsAnnoun tools.urbane r     01/18/2021 00:25:09 task@tools MASTER

Mentioned in SAL (#wikimedia-cloud) [2021-01-26T16:27:26Z] <bd808> Hard reboot of tools-sgeexec-0906 via Horizon for T272978

Post reboot the grid scheduler seems to have realized that the node is empty:

$ qhost -j -h tools-sgeexec-0906.tools.eqiad.wmflabs
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
tools-sgeexec-0906.tools.eqiad.wmflabs lx-amd64        4    4    4    4  0.51    7.8G  328.3M   23.9G     0.0
Bstorm claimed this task.