Debugging something completely different I've noticed that the ores hosts are logging few segfault errors in syslog at random.
It seems that they are all the same and reference _tokenizer.cpython-35m-x86_64-linux-gnu.so that seems to come from the mwparserfromhell Python package included in ores's virtualenvs.
$ sudo cumin -x 'A:ores' "zgrep -c 'segfault' /var/log/syslog.* | grep -v ':0$'" IGNORE EXIT CODES mode enabled, all commands executed will be considered successful 18 hosts will be targeted: ores[2001-2009].codfw.wmnet,ores[1001-1009].eqiad.wmnet Confirm to continue [y/n]? y ===== NODE GROUP ===== (1) ores1002.eqiad.wmnet ----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' ----- /var/log/syslog.7.gz:2 ===== NODE GROUP ===== (2) ores2009.codfw.wmnet,ores1003.eqiad.wmnet ----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' ----- /var/log/syslog.1:1 ===== NODE GROUP ===== (1) ores2007.codfw.wmnet ----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' ----- /var/log/syslog.1:1 /var/log/syslog.3.gz:1 /var/log/syslog.4.gz:1 /var/log/syslog.5.gz:1 ===== NODE GROUP ===== (1) ores1009.eqiad.wmnet ----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' ----- /var/log/syslog.1:1 /var/log/syslog.4.gz:1 ===== NODE GROUP ===== (1) ores2001.codfw.wmnet ----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' ----- /var/log/syslog.7.gz:1 ===== NODE GROUP ===== (1) ores2003.codfw.wmnet ----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' ----- /var/log/syslog.4.gz:1 /var/log/syslog.7.gz:1 ===== NODE GROUP ===== (1) ores1006.eqiad.wmnet ----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' ----- /var/log/syslog.4.gz:1 /var/log/syslog.6.gz:1 ===== NODE GROUP ===== (1) ores2004.codfw.wmnet ----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' ----- /var/log/syslog.6.gz:1 ===== NODE GROUP ===== (1) ores2002.codfw.wmnet ----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' ----- /var/log/syslog.2.gz:1 ===== NODE GROUP ===== (1) ores1004.eqiad.wmnet ----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' ----- /var/log/syslog.2.gz:1 /var/log/syslog.3.gz:1 ===== NODE GROUP ===== (2) ores2005.codfw.wmnet,ores1005.eqiad.wmnet ----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' ----- /var/log/syslog.5.gz:1 ===== NODE GROUP ===== (1) ores1001.eqiad.wmnet ----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' ----- /var/log/syslog.5.gz:1 /var/log/syslog.7.gz:2 ================ PASS: |โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 100% (18/18) [00:00<00:00, 20.95hosts/s] FAIL: | | 0% (0/18) [00:00<?, ?hosts/s] 100.0% (18/18) success ratio (>= 100.0% threshold) for command: 'zgrep -c 'segfau... | grep -v ':0$''. 100.0% (18/18) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
This is a specific log example:
May 8 22:57:12 ores2007 kernel: [4695113.152807] celery[24047]: segfault at 8 ip 00007f0b48d8af10 sp 00007ffeb72bb610 error 4 in _tokenizer.cpython-35m-x86_64-linux-gnu.so[7f0b48d84000+d000] May 8 22:57:12 ores2007 celery-ores-worker[13749]: [2019-05-08 22:57:12,361: ERROR/MainProcess] Process 'ForkPoolWorker-14771' pid:24047 exited with 'signal 11 (SIGSEGV)'
I didn't find any related bug upstream that shouldn't be already included into our deployed version of mwparserfromhell.
Unless is something known we might need to enable core dumps and wait for a repro.