Page MenuHomePhabricator

php5 session cleanup script can go nuts
Closed, ResolvedPublic

Description

I think I may have tracked down a cause of Trusty VMs (and labs instances) occasionally locking up.

The php5 package shipped with Ubuntu 14.04 includes a cron job to cleanup stale session files stored on disk. A similar cron job has been present in previous versions of Ubuntu, but the version shipped with 14.04 attempts to fix a long standing bug https://bugs.debian.org/626640 by scanning for sessions in active use before purging. To do this, the upstream maintainer decided that running /usr/bin/lsof -w -l +d "/var/lib/php5" would be a great way to find out if any processes had session files open. As pointed out in an upstream bug report https://bugs.launchpad.net/ubuntu/+source/php5/+bug/1356113 lsof can have a highly variable runtime cost depending on the processes running at the time it is invoked. I have now several times caught multiple versions of the /usr/lib/php5/sessionclean script running on my VM. Since this script is only invoked at :09 and :39 of each hour by cron, if multiple instances are seen this implies that the older version has been running for at least 30 minutes.

The Ubuntu 12.04 version of the cleanup script was inlined in /etc/cron.d/php5:

[ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete

The proposed fix in the upstream Debian package is at http://anonscm.debian.org/cgit/pkg-php/php.git/tree/debian/sessionclean. This fix is not currently scheduled for backporting to Ubuntu 14.04.

I think we should either, 1) revert the cron job to the 12.04 version, 2) backport the Debian fix as part of our Puppet configuration, or 3) remove the clean up job entirely as part of our Puppet configuration.

Since we do not currently set $wgSessionsInObjectCache, we are actively using files to store php sessions. I think this implies that option 3 would be less than optimal.


Version: unspecified
Severity: normal

Details

Reference
bz71645

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:44 AM
bzimport set Reference to bz71645.
bzimport added a subscriber: Unknown Object (MLST).

I setup a brand new vm and left it running overnight. When I checked it today I found this in the process list:

$ pstree
init─┬─VBoxService───7*[{VBoxService}]

├─acpid
├─apache2───5*[apache2]
├─atd
├─cron───20*[cron───sh───sessionclean─┬─awk]
│                                     ├─lsof───lsof]
│                                     └─xargs]

That is 20 copies of the session cleanup script running simultaneously.

gerritadmin wrote:

Change 164877 had a related patch set uploaded by BryanDavis:
Backport sessionclean from Debian package

https://gerrit.wikimedia.org/r/164877

gerritadmin wrote:

Change 164877 merged by jenkins-bot:
Backport sessionclean from Debian package

https://gerrit.wikimedia.org/r/164877

I just saw this same problem pop up in Tool Labs on bastion-02. There were 61 copies of the cleanup script running. I wonder if I should port my backport to operations/puppet.git? @yuvipanda thoughts?

tools-bastion-05 had 68 copies running just now.