Page MenuHomePhabricator

Codesearch is regularly partially or fully unavailable
Closed, ResolvedPublic

Description

Some indexes are down right now:

image.png (577×233 px, 44 KB)

This happened about once a day for at least the last week or so (e.g. T417147: Codesearch is down for Pywikibot (2026-02-11)).

There were also multiple full outages recently (not all have a task though as usually it recovers after about an hour):
T416614: Codesearch down/unreachable (2026-02-05)
T416488: Codesearch down/unreachable (2026-02-04)

Event Timeline

krinkle@codesearch9:~$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
…
codesea+    2156       1  0 Jan16 ?        00:12:31 /usr/bin/docker run -p 3003:80 --name codesearch-frontend -e CODESEARCH_HOUND_BASE=http://172.17.0.1:3002
codesea+    2300       1  0 Jan16 ?        00:11:52 /usr/bin/docker run -p 6088:6080 --name hound-milkshake --user=root -v /srv/hound/hound-milkshake:/data -v
codesea+    2362       1  0 Jan16 ?        00:13:35 /usr/bin/docker run -p 6096:6080 --name hound-shouthow --user=root -v /srv/hound/hound-shouthow:/data -v /
codesea+    2430       1  0 Jan16 ?        00:13:49 /usr/bin/docker run -p 6087:6080 --name hound-armchairgm --user=root -v /srv/hound/hound-armchairgm:/data
root        2751     630  0 Jan16 ?        00:00:05 /usr/sbin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 3003 -container-ip 172.17.0.2 -container-por
root        2759     630  0 Jan16 ?        00:00:06 /usr/sbin/docker-proxy -proto tcp -host-ip :: -host-port 3003 -container-ip 172.17.0.2 -container-port 80
root        2869     630  0 Jan16 ?        00:03:18 /usr/sbin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 6088 -container-ip 172.17.0.3 -container-por
root        2876     630  0 Jan16 ?        00:00:05 /usr/sbin/docker-proxy -proto tcp -host-ip :: -host-port 6088 -container-ip 172.17.0.3 -container-port 608
root        3250     630  0 Jan16 ?        00:03:16 /usr/sbin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 6096 -container-ip 172.17.0.9 -container-por
root        3265     630  0 Jan16 ?        00:00:05 /usr/sbin/docker-proxy -proto tcp -host-ip :: -host-port 6096 -container-ip 172.17.0.9 -container-port 608
root        3551     630  0 Jan16 ?        00:03:41 /usr/sbin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 6087 -container-ip 172.17.0.11 -container-po
root        3567     630  0 Jan16 ?        00:00:05 /usr/sbin/docker-proxy -proto tcp -host-ip :: -host-port 6087 -container-ip 172.17.0.11 -container-port 60

…

root        2961       1  0 Jan16 ?        00:10:55 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 95d10e812a526149813567e8edc5238d06e7485bec2b6482f70d7
root        2966       1  0 Jan16 ?        00:06:42 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 3c1363bac8c2479ffe8160dd444b4e75790c9526b8b14033db23f
root        3470       1  0 Jan16 ?        00:06:06 /usr/bin/containerd-shim-runc-v2 -namespace moby -id ec42001862d13bee9d68759e4de20a9994a35f5a70e2d9442f817
root        3692       1  0 Jan16 ?        00:06:26 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 31b6fe19eae6278190967a43c0328e57ece87d0f1d354a85ee962

…

root        3044    2966  0 Jan16 ?        00:10:16 ./houndd -conf /data/config.json
root        3512    3470  0 Jan16 ?        00:13:29 ./houndd -conf /data/config.json
root        3765    3692  0 Jan16 ?        00:12:59 ./houndd -conf /data/config.json

…

root        3047    2961  0 Jan16 ?        00:04:09 apache2 -DFOREGROUND

…

root        2930    3512  0 Feb11 ?        00:00:00 [git] <defunct>
root        2959    3512  0 Feb11 ?        00:00:00 [git] <defunct>
root        3936    3512  0 Jan16 ?        00:00:00 [git] <defunct>
root        4116    3512  0 Jan16 ?        00:00:00 [git] <defunct>
root        6665    3512  0 Feb06 ?        00:00:00 [git] <defunct>
root        6716    3512  0 Feb06 ?        00:00:00 [git] <defunct>
root       11723    3512  0 Feb01 ?        00:00:00 [git] <defunct>
root       11735    3512  0 Feb01 ?        00:00:00 [git] <defunct>
root       21448    3512  0 Jan24 ?        00:00:00 [git] <defunct>
root       21470    3512  0 Jan24 ?        00:00:00 [git] <defunct>
root       30196    3512  0 Jan21 ?        00:00:00 [git] <defunct>
root       30208    3512  0 Jan21 ?        00:00:00 [git] <defunct>
root       60943    3512  0 Jan19 ?        00:00:00 [git] <defunct>
root       60973    3512  0 Jan19 ?        00:00:00 [git] <defunct>
root       69893    3512  0 Jan29 ?        00:00:00 [git] <defunct>
root       69915    3512  0 Jan29 ?        00:00:00 [git] <defunct>
root       77657    3512  0 Jan27 ?        00:00:00 [git] <defunct>
root       77684    3512  0 Jan27 ?        00:00:00 [git] <defunct>
…

There's about ~1000 defunct git processes in total, 910 belong to PID 3512, which is:

root        3512    3470  0 Jan16 ?        00:13:29 ./houndd -conf /data/config.json

Unfortunately, this lacks context to identify the backend name. Based on timestamp, only one Hound container started before this timestamp (One could presumably map the moby/docker hash somehow or inspect the process and Linux namespace some other way.)

codesea+    2300       1  0 Jan16 ?        00:11:52 /usr/bin/docker run -p 6088:6080 --name hound-milkshake --user=root -v /srv/hound/hound-milkshake:/data -v

For now I've sudo kill -9'ed the 3 houndd instances that had zombies. I'm looking at syslog next.

The last errors in syslog are caused by my kills. I also found what appears to be unrelated confd errors, which I filed at T417458: confd fails with "no such host" in SRV lookup from _etcd-client-ssl._tcp.codesearch.eqiad1.wikimedia.cloud.

Prior to my kills, the following stands out:

$ sudo tail -n10000 /var/log/syslog | grep -E 'hound|git|fatal|error'

2026-02-13T20:58:30+00:00 codesearch9 dockerd[630]: time="2026-02-13T20:58:30.582761078Z" level=error msg="Handler for POST /v1.41/containers/create returned error: mkdir /var/lib/docker/overlay2/42a40e41de17ec4242d62c3ce60b1b670346c189dbe855e254c0640581069dce-init/diff: no space left on device"
2026-02-13T20:58:30+00:00 codesearch9 systemd[1]: hound-core.service: Main process exited, code=exited, status=125/n/a
2026-02-13T20:58:30+00:00 codesearch9 systemd[1]: hound-core.service: Failed with result 'exit-code'.
2026-02-13T20:58:30+00:00 codesearch9 docker[1091236]: Error response from daemon: Cannot kill container: hound-skins: No such container: hound-skins
2026-02-13T20:58:30+00:00 codesearch9 docker[1091225]: Error: No such container: hound-services
2026-02-13T20:58:30+00:00 codesearch9 dockerd[630]: time="2026-02-13T20:58:30.626858965Z" level=error msg="Handler for POST /v1.41/containers/create returned error: mkdir /var/lib/docker/overlay2/307c526a7c37f326a1206c57a139ff4c27fdc599b291cd7cd0f6cbb20d2f7148-init/diff: no space left on device"
2026-02-13T20:58:30+00:00 codesearch9 systemd[1]: hound-operations.service: Main process exited, code=exited, status=125/n/a
2026-02-13T20:58:30+00:00 codesearch9 systemd[1]: hound-operations.service: Failed with result 'exit-code'.
2026-02-13T20:58:30+00:00 codesearch9 systemd[1]: hound-core.service: Scheduled restart job, restart counter is at 5.



2026-02-13T20:58:31+00:00 codesearch9 systemd[1]: hound-bundled.service: Main process exited, code=exited, status=125/n/a
2026-02-13T20:58:31+00:00 codesearch9 systemd[1]: hound-bundled.service: Failed with result 'exit-code'.

2026-02-13T20:58:31+00:00 codesearch9 docker[1091454]: Error response from daemon: Cannot kill container: hound-extensions: No such container: hound-extensions
2026-02-13T20:58:31+00:00 codesearch9 docker[1091466]: Error response from daemon: Cannot kill container: hound-things: No such container: hound-things
2026-02-13T20:58:31+00:00 codesearch9 systemd[1]: hound-bundled.service: Scheduled restart job, restart counter is at 341.

The Hound services are in a restart loop via systemd.

Disk usage looks fine, on the host anyway. (Not sure if the containers are restricted in their disk space?)

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            3.9G     0  3.9G   0% /dev
tmpfs           796M  1.0M  795M   1% /run
/dev/sda1        20G   13G  5.9G  69% /
tmpfs           3.9G     0  3.9G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      124M   12M  113M  10% /boot/efi
/dev/sdb        157G   74G   76G  50% /srv
tmpfs           796M     0  796M   0% /run/user/0
tmpfs           796M     0  796M   0% /run/user/2008
krinkle@codesearch9:~$ sudo cat /var/log/syslog.13.gz | gunzip | fgrep "no space left on device" | wc -l
0
krinkle@codesearch9:~$ sudo cat /var/log/syslog.10.gz | gunzip | fgrep "no space left on device" | wc -l
0
krinkle@codesearch9:~$ sudo cat /var/log/syslog.{8,7,6,5,4}.gz | gunzip | fgrep "no space left on device" | wc -l
0
krinkle@codesearch9:~$ sudo cat /var/log/syslog.2.gz | gunzip | fgrep "no space left on device" | wc -l
0
krinkle@codesearch9:~$ sudo head /var/log/syslog.1
2026-02-06T02:26:52 …

krinkle@codesearch9:~$ sudo cat /var/log/syslog.1 | fgrep "no space left on device" | wc -l
0
krinkle@codesearch9:~$ sudo head /var/log/syslog  
2026-02-07T03:53:19
krinkle@codesearch9:~$ sudo cat /var/log/syslog | fgrep "no space left on device" | wc -l
16918

If this is the problem, then it seems to be a new problem after Feb 7.

Mentioned in SAL (#wikimedia-cloud) [2026-02-13T21:29:45Z] <Krinkle> Reboot codesearch9 instance, ref T417397

Mentioned in SAL (#wikimedia-cloud) [2026-02-13T21:39:05Z] <Krinkle> Manually pruning codesearch9:/var/log/account/pacct, ref T417397 T413739

Ah, the curse of inodes again. This time on the root drive instead of the /srv volume.

Grafana dashboard:

Screenshot 2026-02-13 at 21.33.39.png (1×1 px, 145 KB)

krinkle@codesearch9:/var/log$ sudo find /var/log/account/ -type f -name "pacct*" -delete

I wasn't sure if this command could actually finish (if files are created faster than it can delete), but I wanted until at least free inodes would plateau. It finished after 20 minutes.

Screenshot 2026-02-13 at 22.06.19.png (1×1 px, 150 KB)

$ df -i  
Filesystem       Inodes   IUsed   IFree IUse% Mounted on
/dev/sda1       1302528  120553 1181975   10% /

/dev/sdb       10485760 5526733 4959027   53% /srv

The backends are up again at https://codesearch.wmcloud.org/_health/ (except the "Everywhere" instance, which will take a bit longer).

I'm leaving this task open because it is about the recurring issue and, unless these were one-time leftovers, I expect this to happen again. In all likelihood the root cause is T413739: "no space left on device" for codesearch9 root disk (out of inodes due to /var/log/account/pacct), but I suggest we leave this open until that one is resolved first in case it goes down again for a different reason until then.

I think this recursion in logrotate config was the culprit https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239441

Deployed a fix and restarted logrotate.

Krinkle claimed this task.