Page MenuHomePhabricator

Phabricator: Clean up deadlocked apache processes
Closed, ResolvedPublic

Description

Until we can find a proper fix for T182832, we need to deal with the deadlocked processes to prevent further service outages.

My proposal, for lack of a better idea, is the following script in a daily cron job:

# adapted from http://snipplr.com/view/46373/kill-gracefully-stuck-startup-of-apache-childs/

apache-status | tail -n +47 | head -n -14 | awk '{print $2,$4}' | grep G | awk '{print $1}' | sort -u | grep "^[0-9]*$" | sort -n | while read -r pida
do
        # for testing purposes, make it a NO-OP command with echo:
	echo "kill -9 $pida"
done

Event Timeline

mmodell triaged this task as High priority.Feb 20 2018, 3:10 PM
mmodell created this task.

I'd personally just restart apache2 once every week in a low-traffic time of the day rather than hitting specific httpd processes, since the transition to G state can happen for httpd even without any error registered. If we want to target only specific processes, the SS field should be taken into consideration to kill only the ones stuck in G state for a long time.

Every week would leave a ton of resources tied up in the mean time - those processes are keeping a lot of memory and presumably other resources in use. Would it really be harmful to kill a gracefully restarting process that isn't deadlocked?

I would proceed with the simplest solution first, then see how it goes and refine if needed :)

Well, that really depends on how you define "needed." The proposed script is definitely not simple but it appears to work. My only reservation is that a full Apache restart causes a brief outage. Even if we do that at a low traffic time, there is still a non-zero chance of causing someone to lose work if they happen to click submit at exactly the wrong time.

I will concede that it is a pretty low risk, so I'll defer to you, @elukey, unless someone else has another opinion. I'm not attached to the script in the description and, unfortunately, anything we do here is going to be less than ideal.

Care to chime in on this one, @Dzahn? I can submit a patch if that would be helpful.

Also, I've updated the incident document at https://wikitech.wikimedia.org/wiki/Incident_documentation/20180206-Phabricator with some conclusions and actionables ( this task is one of them ).

re: "simplest solution first" i would suggest we just do

systemctl apache2 restart

also per what elukey said above " the transition to G state can happen for httpd even without any error registered"
and to keep things simple and not rely on status page and the head/awk/grep/sort construct

Change 413114 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/puppet@production] Phabricator: restart apache every sunday night

https://gerrit.wikimedia.org/r/413114

Change 413114 merged by Dzahn:
[operations/puppet@production] Phabricator: restart apache every sunday night

https://gerrit.wikimedia.org/r/413114

restart cron has been installed on both servers

Change 512977 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: remove cron to restart httpd

https://gerrit.wikimedia.org/r/512977

Change 512977 merged by Dzahn:
[operations/puppet@production] phabricator: remove cron to restart httpd

https://gerrit.wikimedia.org/r/512977

Mentioned in SAL (#wikimedia-traffic) [2019-05-28T20:57:59Z] <mutante> phab1003 / phab2001 - removing 'apache restart' from root's crontab (gerrit:512977) (T187790)

removed again since we are not seeing the leaks anymore since our recent upgrade to stretch and phab1003