Phabricator: Clean up deadlocked apache processes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• mmodell
	Feb 20 2018, 3:10 PM

Description

Until we can find a proper fix for T182832, we need to deal with the deadlocked processes to prevent further service outages.

My proposal, for lack of a better idea, is the following script in a daily cron job:

# adapted from http://snipplr.com/view/46373/kill-gracefully-stuck-startup-of-apache-childs/

apache-status | tail -n +47 | head -n -14 | awk '{print $2,$4}' | grep G | awk '{print $1}' | sort -u | grep "^[0-9]*$" | sort -n | while read -r pida
do
        # for testing purposes, make it a NO-OP command with echo:
	echo "kill -9 $pida"
done

Details

	Subject	Repo	Branch	Lines +/-
	phabricator: remove cron to restart httpd	operations/puppet	production	+0 -10
	Phabricator: restart apache every sunday night	operations/puppet	production	+10 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• mmodell	T182832 Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state
		Resolved		• mmodell	T187790 Phabricator: Clean up deadlocked apache processes

Event Timeline

• mmodell triaged this task as High priority.Feb 20 2018, 3:10 PM

• mmodell created this task.

• mmodell moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Feb 20 2018, 3:12 PM

I'd personally just restart apache2 once every week in a low-traffic time of the day rather than hitting specific httpd processes, since the transition to G state can happen for httpd even without any error registered. If we want to target only specific processes, the SS field should be taken into consideration to kill only the ones stuck in G state for a long time.

Every week would leave a ton of resources tied up in the mean time - those processes are keeping a lot of memory and presumably other resources in use. Would it really be harmful to kill a gracefully restarting process that isn't deadlocked?

I would proceed with the simplest solution first, then see how it goes and refine if needed :)

Well, that really depends on how you define "needed." The proposed script is definitely not simple but it appears to work. My only reservation is that a full Apache restart causes a brief outage. Even if we do that at a low traffic time, there is still a non-zero chance of causing someone to lose work if they happen to click submit at exactly the wrong time.

I will concede that it is a pretty low risk, so I'll defer to you, @elukey, unless someone else has another opinion. I'm not attached to the script in the description and, unfortunately, anything we do here is going to be less than ideal.

Care to chime in on this one, @Dzahn? I can submit a patch if that would be helpful.

Also, I've updated the incident document at https://wikitech.wikimedia.org/wiki/Incident_documentation/20180206-Phabricator with some conclusions and actionables ( this task is one of them ).

re: "simplest solution first" i would suggest we just do

systemctl apache2 restart

also per what elukey said above " the transition to G state can happen for httpd even without any error registered"
and to keep things simple and not rely on status page and the head/awk/grep/sort construct

Change 413114 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/puppet@production] Phabricator: restart apache every sunday night

https://gerrit.wikimedia.org/r/413114

gerritbot added a project: Patch-For-Review.Feb 21 2018, 8:23 AM

Change 413114 merged by Dzahn:
[operations/puppet@production] Phabricator: restart apache every sunday night

https://gerrit.wikimedia.org/r/413114

restart cron has been installed on both servers

• mmodell closed this task as Resolved.Feb 21 2018, 10:38 PM

Change 512977 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: remove cron to restart httpd

https://gerrit.wikimedia.org/r/512977

Change 512977 merged by Dzahn:
[operations/puppet@production] phabricator: remove cron to restart httpd

https://gerrit.wikimedia.org/r/512977

Mentioned in SAL (#wikimedia-traffic) [2019-05-28T20:57:59Z] <mutante> phab1003 / phab2001 - removing 'apache restart' from root's crontab (gerrit:512977) (T187790)

removed again since we are not seeing the leaks anymore since our recent upgrade to stretch and phab1003

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

Maintenance_bot removed a project: Patch-For-Review.Apr 28 2020, 10:15 PM

Phabricator: Clean up deadlocked apache processesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Phabricator: Clean up deadlocked apache processes
Closed, ResolvedPublic
Actions

Related Objects
Search...