The background runner of the QuickCategories tool sometimes gets stuck making no edits even though there is work to be done. A restart seems to fix it, but today I wanted to look a bit more into it, and it looks like it’s some kind of network issue, though I don’t fully understand it yet.
It runs as the background-runner deployment in the quickcategories tool (manually created k8s deployment, not migrated to toolforge-jobs yet). Currently, it runs as the pod background-runner-7854584c48-lthpd, on the hostIP 172.16.4.200. So I SSHed into that system as root to have a closer look.
The background runner has the PID 793594:
root@tools-k8s-worker-nfs-63:~# ps -ef | grep [q]uickcategories tools.q+ 793594 2643199 0 Sep03 ? 00:00:00 python3 /srv/quickcategories/background_runner.py
It’s stuck reading from some network connection:
root@tools-k8s-worker-nfs-63:~# timeout 1m strace -p793594 -yy strace: Process 793594 attached read(4<TCP:[156922863]>, strace: Process 793594 detached <detached ...>
The other end of the connection appears to be text-lb.eqiad.wikimedia.org, port 443 (HTTPS):
root@tools-k8s-worker-nfs-63:~# cat /proc/793594/net/tcp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode 0: 207DA8C0:EBEE E09A50D0:01BB 01 00000000:00000000 00:00000000 00000000 53976 0 156922863 1 ffff88c9a8484800 20 4 27 10 -1 $ for hex in D0 50 9A E0 01BB; do bc <<< "ibase=16; $hex"; done # the rem_address is in unintuitive-endian hexadecimal 208 80 154 224 443 root@tools-k8s-worker-nfs-63:~# host 208.80.154.224 224.154.80.208.in-addr.arpa domain name pointer text-lb.eqiad.wikimedia.org.
Connecting to text-lb makes sense (presumably this is an API call to some wiki, hard to tell which), but I have no idea why the connection is stuck. Does anyone know?