I see AnomieBOT jobs 1002344 and 1228705, both claiming to be running on tools-sgeexec-0906, seem to not actually be running. I was unable to ssh to tools-sgeexec-0906 to check (while ssh to tools-sgeexec-0907 worked fine). qdel -f does not seem to have killed the jobs either.
Description
Description
Event Timeline
Comment Actions
[16:08] < bd808> !log tools.anomiebot Force deleted stuck jobs 1002344 and 1228705 per IRC request by anomie
Those stuck tasks are at least gone for you now @Anomie
Comment Actions
Console log via Horizon shows a kernel panic
[10572876.902848] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 [10572876.906701] IP: [<ffffffffa2e60aac>] lock_get_status+0x5c/0x360 [10572876.910190] PGD 0 [10572876.911364] [10572876.912396] Oops: 0000 [#1] SMP [10572876.914467] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace fscache sch_ingress cls_u32 sch_htb binfmt_misc qxl ttm drm_kms_helper crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drm virtio_balloon evdev joydev pcspkr serio_raw button act_mirred ifb sunrpc ip_tables x_tables autofs4 ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache hid_generic usbhid hid ata_generic dm_mod virtio_net virtio_blk crc32c_intel ata_piix uhci_hcd libata ehci_hcd aesni_intel scsi_mod aes_x86_64 glue_helper psmouse lrw gf128mul ablk_helper cryptd usbcore virtio_pci virtio_ring virtio usb_common i2c_piix4 [10572876.944422] CPU: 2 PID: 27123 Comm: lsof Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3 [10572876.947883] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.12.0-1 04/01/2014 [10572876.951635] task: ffff965dc61b5080 task.stack: ffffb2d512758000 [10572876.954125] RIP: 0010:[<ffffffffa2e60aac>] [<ffffffffa2e60aac>] lock_get_status+0x5c/0x360 [10572876.959983] RSP: 0018:ffffb2d51275bde8 EFLAGS: 00010286 [10572876.964095] RAX: 0000000000000000 RBX: ffff965ef548cea0 RCX: ffffffffa3618e35 [10572876.967359] RDX: 0000000000000000 RSI: ffffffffa3845d20 RDI: ffff965ec6f84980 [10572876.970136] RBP: ffff965ea931a900 R08: ffffffffffffffff R09: 000000000000000f [10572876.972999] R10: 0000000000000000 R11: 000000019d8bd52c R12: 00000000000039a4 [10572876.975785] R13: ffff965ec6e2bea0 R14: 0000000000000001 R15: ffffffffa3618e35 [10572876.978634] FS: 00007fd59234ff40(0000) GS:ffff965effd00000(0000) knlGS:0000000000000000 [10572876.981920] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10572876.984190] CR2: 0000000000000030 CR3: 0000000147838000 CR4: 0000000000140670 [10572876.986993] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [10572876.989760] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [10572876.992528] Stack: [10572876.993406] ffff965ef548ceb8 ffff965ea931a900 ffff965ef548cea0 ffff965ec6e2bea0 [10572876.996423] ffffffffa3610282 ffff965ea931a900 ffffffffa2e60ec0 0000000000000000 [10572876.999869] ffffb2d51275bf08 ffff965ef1d1e300 0000000000001000 ffff965ef548ceb8 [10572877.003819] Call Trace: [10572877.005031] [<ffffffffa2e60ec0>] ? locks_show+0x60/0xa0 [10572877.007683] [<ffffffffa2e31636>] ? seq_read+0x106/0x400 [10572877.010359] [<ffffffffa2e7ac70>] ? proc_reg_read+0x40/0x70 [10572877.013240] [<ffffffffa2e0b261>] ? vfs_read+0x91/0x130 [10572877.015493] [<ffffffffa2e0c732>] ? SyS_read+0x52/0xc0 [10572877.017790] [<ffffffffa2c03b7d>] ? do_syscall_64+0x8d/0xf0 [10572877.020200] [<ffffffffa321924e>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6 [10572877.023268] Code: 8b 40 28 48 8b b0 60 04 00 00 e8 80 74 e3 ff 85 c0 41 89 c4 0f 84 bd 01 00 00 48 8b 43 70 48 85 c0 0f 84 cc 01 00 00 48 8b 40 18 <4c> 8b 68 30 4c 89 f9 4c 89 f2 48 c7 c6 cc 01 61 a3 48 89 ef e8 [10572877.035233] RIP [<ffffffffa2e60aac>] lock_get_status+0x5c/0x360 [10572877.037842] RSP <ffffb2d51275bde8> [10572877.039283] CR2: 0000000000000030 [10572877.043461] ---[ end trace 37bb73a8ba11ad30 ]---
Comment Actions
Tried to get a job list for posterity
bstorm@tools-sgebastion-08:~$ qhost -j -h tools-sgeexec-0906.tools.eqiad.wmflabs HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------- global - - - - - - - - - - tools-sgeexec-0906.tools.eqiad.wmflabs lx-amd64 4 4 4 4 4.82K 7.8G 4.6G 23.9G 0.0 job-ID prior name user state submit/start at queue master ja-task-ID ---------------------------------------------------------------------------------------------- 66432 0.33540 signature- tools.signat Rr 10/14/2020 19:46:12 continuous MASTER 85368 0.29826 musescore- tools.archiv Rr 11/28/2020 23:19:38 continuous MASTER 1138789 0.33546 ptwikisbot tools.ptwiki Rr 10/29/2020 17:31:30 continuous MASTER 4302517 0.28218 iabotwatch tools.botwik dr 12/18/2020 12:49:56 continuous MASTER 1045324 0.25978 serobot.bo tools.serobo r 01/14/2021 18:41:54 continuous MASTER 1221698 0.25725 patrolAfte tools.urbane dr 01/17/2021 20:45:09 continuous MASTER 49590 0.35739 java tools.sammou r 09/18/2020 01:15:19 task@tools MASTER 728405 0.26436 cron-tools tools.lp-too r 01/09/2021 05:00:21 task@tools MASTER 1228437 0.25715 n500v5ts tools.jarbot r 01/17/2021 23:35:24 task@tools MASTER 1230056 0.25713 shaher tools.khanam r 01/18/2021 00:10:24 task@tools MASTER 1230650 0.25712 zopsAnnoun tools.urbane r 01/18/2021 00:25:09 task@tools MASTER
Comment Actions
Mentioned in SAL (#wikimedia-cloud) [2021-01-26T16:27:26Z] <bd808> Hard reboot of tools-sgeexec-0906 via Horizon for T272978
Comment Actions
Post reboot the grid scheduler seems to have realized that the node is empty:
$ qhost -j -h tools-sgeexec-0906.tools.eqiad.wmflabs HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------- global - - - - - - - - - - tools-sgeexec-0906.tools.eqiad.wmflabs lx-amd64 4 4 4 4 0.51 7.8G 328.3M 23.9G 0.0