Page MenuHomePhabricator

Measure segfaults in mediawiki and parsoid servers
Open, MediumPublic

Description

We are aware that daily some mediawiki servers segfault, for example

php-fpm7.2[18796]: segfault at 5ffddb105c ip 00005575f7234ae3 sp 00007ffd32aad530 error 4 in php-fpm7.2[5575f6f92000+410000]

php-fpm7.2[18795]: segfault at 215 ip 00007f8b49b4d3a9 sp 00007ffd32aad490 error 4 in libmemcached.so.11.0.0[7f8b49b3c000+30000]

It might make sense to use mtail and measure those segfaults, in the future we could potentially alert on this and avoid running into issues when we deploy php extentions and other related updates

Event Timeline

jijiki triaged this task as Medium priority.Feb 28 2020, 6:13 PM

We do have this data in mtail already but it isn't the easiest to work with: https://w.wiki/Ttx

There's also something odd happening there where the rate of segfaults seems to decline after 16:00 UTC and reset at 00:00 UTC. I don't think that's accurate in reality:

1✔️ root@centrallog2001.codfw.wmnet /srv/syslog 🕐⁉️ ls mw*/syslog.log-20200612 | xargs -n10 -P3 grep --line-buffered -hP 'mw\d{4} kernel: \[[0-9.]+\] .*: segfault at' | awk '$7 ~ /php-fpm7\.2/ {print $3}' | cut -d: -f1 | sort | uniq -c
2 12 00
3 18 01
4 9 02
5 16 03
6 14 04
7 12 05
8 3 06
9 3 07
10 14 08
11 15 09
12 13 10
13 25 11
14 22 12
15 4 13
16 3 14
17 5 15
18 7 16
19 8 17
20 18 18
21 21 19
22 4 20
23 30 21
24 16 22
25 3 23

elukey added a project: observability.

Getting back to this, added the Observability tag to get some feedback from the team as well.

Change 721773 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] mtail: add counter for kernel traps

https://gerrit.wikimedia.org/r/721773

It seems that we do have an mtail metric for segfaults, but when checking Thanos I see a never-increasing counter, even if I can find segfaults during the recent days (maybe I am missing something trivial). When we clear out this issue, and possibly update mtail for traps as well, we could add those to dashboards.

Change 721773 merged by Elukey:

[operations/puppet@production] mtail: add counter for kernel traps

https://gerrit.wikimedia.org/r/721773

Mentioned in SAL (#wikimedia-operations) [2021-09-29T09:54:21Z] <godog> bounce mtail on centrallog* - T246470

While investigating this with @elukey I noticed mtail_lines_total has stopped increasing for centrallog in march (!), a separate but related problem for sure (to be tackled in another task)

While investigating this with @elukey I noticed mtail_lines_total has stopped increasing for centrallog

This has been fixed, and I can see trap metrics working now:

traps{binary="php-fpm7.2", cluster="syslog", hostname="mw1367", instance="centrallog2001:3903", job="mtail", prog="kernel.mtail", prometheus="ops", site="codfw"} 1
traps{binary="php-fpm7.2", cluster="syslog", hostname="mw1442", instance="centrallog1001:3903", job="mtail", prog="kernel.mtail", prometheus="ops", site="eqiad"} 1
traps{binary="php-fpm7.2", cluster="syslog", hostname="mw1442", instance="centrallog2001:3903", job="mtail", prog="kernel.mtail", prometheus="ops", site="codfw"} 1
traps{binary="php-fpm7.2", cluster="syslog", hostname="wtp1038", instance="centrallog2001:3903", job="mtail", prog="kernel.mtail", prometheus="ops", site="codfw"} 1

I can confirm, thanks a lot for the quick fix!