Page MenuHomePhabricator

QuickCategories background runner sometimes hangs for no apparent reason
Closed, ResolvedPublicBUG REPORT

Description

The background runner of the QuickCategories tool sometimes gets stuck making no edits even though there is work to be done. A restart seems to fix it, but today I wanted to look a bit more into it, and it looks like it’s some kind of network issue, though I don’t fully understand it yet.

It runs as the background-runner deployment in the quickcategories tool (manually created k8s deployment, not migrated to toolforge-jobs yet). Currently, it runs as the pod background-runner-7854584c48-lthpd, on the hostIP 172.16.4.200. So I SSHed into that system as root to have a closer look.

The background runner has the PID 793594:

root@tools-k8s-worker-nfs-63:~# ps -ef | grep [q]uickcategories
tools.q+  793594 2643199  0 Sep03 ?        00:00:00 python3 /srv/quickcategories/background_runner.py

It’s stuck reading from some network connection:

root@tools-k8s-worker-nfs-63:~# timeout 1m strace -p793594 -yy
strace: Process 793594 attached
read(4<TCP:[156922863]>, strace: Process 793594 detached
 <detached ...>

The other end of the connection appears to be text-lb.eqiad.wikimedia.org, port 443 (HTTPS):

root@tools-k8s-worker-nfs-63:~# cat /proc/793594/net/tcp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode                                                     
   0: 207DA8C0:EBEE E09A50D0:01BB 01 00000000:00000000 00:00000000 00000000 53976        0 156922863 1 ffff88c9a8484800 20 4 27 10 -1                
$ for hex in D0 50 9A E0 01BB; do bc <<< "ibase=16; $hex"; done # the rem_address is in unintuitive-endian hexadecimal
208
80
154
224
443
root@tools-k8s-worker-nfs-63:~# host 208.80.154.224
224.154.80.208.in-addr.arpa domain name pointer text-lb.eqiad.wikimedia.org.

Connecting to text-lb makes sense (presumably this is an API call to some wiki, hard to tell which), but I have no idea why the connection is stuck. Does anyone know?

Event Timeline

I’m leaving the background runner in its current state for a bit in case someone else wants to take a look, but at some point I’ll restart it again to get the edits unstuck.

If we can’t figure out the underlying issue, I suppose I could:

  • convert the background runner to toolforge-jobs
  • make the code touch some tmp file each time an edit was made or there was no work to be done
  • add a health check that checks whether the file was touched within the last 5 minutes

That would (presumably) restart the background runner if it gets stuck, but it’s not an ideal solution.

That read() seems to be happily blocking forever, by the way:

root@tools-k8s-worker-nfs-63:~# timeout 60m strace -p793594 -yy
strace: Process 793594 attached
read(4<TCP:[156922863]>, strace: Process 793594 detached
 <detached ...>

Mentioned in SAL (#wikimedia-cloud) [2024-09-14T13:44:48Z] <wmbot~lucaswerkmeister@tools-bastion-13> deployed 6d6aed57e0 (update dependencies); includes a restart of the background runner which probably “fixes” T374152 for now

Something’s definitely running again:

tools.quickcategories@tools-bastion-13:~/www/python/src$ kubectl logs background-runner-85d54d86b6-rqw6h --tail=10
Running command 1386441 of batch #7680... CommandEdit
Running command 1392724 of batch #7722... CommandEdit
Running command 1387741 of batch #7682... CommandNoop
Running command 1386442 of batch #7680... CommandEdit
Running command 1387742 of batch #7682... CommandEdit
Running command 1392725 of batch #7722... CommandEdit
Running command 1386443 of batch #7680... CommandEdit
Running command 1387743 of batch #7682... CommandEdit
Running command 1392726 of batch #7722... CommandEdit
Running command 1386444 of batch #7680... CommandEdit

I’ll still see if I can get the health check working.

Mentioned in SAL (#wikimedia-cloud) [2024-09-18T20:02:06Z] <wmbot~lucaswerkmeister@tools-bastion-13> deployed b9a658f45e (health check for background runner, T374152)

Alright, the background runner should now have a health check and hopefully restart itself if it gets stuck.

LucasWerkmeister claimed this task.

Looks like it already restarted itself once:

tools.quickcategories@tools-bastion-13:~$ kubectl get pods
NAME                                READY   STATUS    RESTARTS      AGE
background-runner-6d4b45f48-mfgkl   1/1     Running   1 (26h ago)   3d21h
quickcategories-76785cb97b-t7z6r    1/1     Running   0             3d21h

I can’t tell why (kubectl get events is empty, they already expired), but I think that’s good enough to close this task.

This happened again today, so I don't know if the restart system is working or it will take a while.

That turned out to be a different problem (the “bad content format” pages at the bottom of https://quickcategories.toolforge.org/batch/7877/ used to make the whole background runner crash).