Page MenuHomePhabricator

mw1209 /usr/bin/timeout: the monitored command dumped core
Closed, ResolvedPublic

Description

Seeing a lot of these in the logs today, both before and after rolling forward wmf.11. All coming from mw1209.eqiad.wmnet

/usr/bin/timeout: the monitored command dumped core

and

/srv/mediawiki/php-1.30.0-wmf.11/includes/limit.sh: line 101: 8405 File size limit exceeded/usr/bin/timeout $MW_WALL_CLOCK_LIMIT /bin/bash -c "$1" 3>&-

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 27 2017, 8:46 PM

Without knowing the command passed to it, I am not sure how to track the root cause of that.

ulimit File size limit exceeded sounds familiar. We had the exact same issue when invoking HHVM which tries to update its cache at /var/cache/hhvm/cli.hhbc.sq3

Turns out mw1209.eqiad.wmnet has a 512MB cache file. So I am pretty sure that is the same as T145819: Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesize limited to 512MBytes and the HHVM byte cache needs to be deleted.

Without knowing the command passed to it, I am not sure how to track the root cause of that.
ulimit File size limit exceeded sounds familiar. We had the exact same issue when invoking HHVM which tries to update its cache at /var/cache/hhvm/cli.hhbc.sq3
Turns out mw1209.eqiad.wmnet has a 512MB cache file. So I am pretty sure that is the same as T145819: Jobs invoking SiteConfiguration::getConfig cause HHVM to fail updating the bytecode cache due to being filesize limited to 512MBytes and the HHVM byte cache needs to be deleted.

hrm, sounds like we need Operations to clear out the cache on this machine, adding to task.

@Joe and I were just looking at this because icinga had fired a disk alert.

The 512M /var/cache/hhvm/cli.hhbc.sq3 file has been removed, and /var/tmp/core has been cleaned up. I left the most recent core in root's home just in case someone still wanted to look at it.

Also we checked for large /var/cache/hhvm/cli.hhbc.sq3 files elsewhere in the cluster and didn't see any >400M.

The /var/cache/hhvm/cli.hhbc.sq3 caches were cleared when I upgraded to 3.18, I doubt any of those grew to 512 again. I also created https://gerrit.wikimedia.org/r/#/c/359120/ to monitor depletion, but hadn't found the time to complete this so far.

thcipriani closed this task as Resolved.Jul 28 2017, 6:37 PM
thcipriani assigned this task to herron.

@Joe and I were just looking at this because icinga had fired a disk alert.
The 512M /var/cache/hhvm/cli.hhbc.sq3 file has been removed, and /var/tmp/core has been cleaned up. I left the most recent core in root's home just in case someone still wanted to look at it.
Also we checked for large /var/cache/hhvm/cli.hhbc.sq3 files elsewhere in the cluster and didn't see any >400M.

This cleared up the error log, calling this one resolved. Thanks all!

herron added a comment.Oct 5 2017, 7:37 PM

FWIW this cropped up on mw1262 today. Same symptoms, large /var/cache/hhvm/cli.hhbc.sq3 file causing rapid core dumps that filled the disk.

@herron : Thanks for addressing it on mw1262. I have some WIP Icinga check for this at https://gerrit.wikimedia.org/r/#/c/359120, I'll pick this up again next week.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM