Page MenuHomePhabricator

HHVM core dumps in Beta Cluster
Closed, ResolvedPublic

Description

https://gerrit.wikimedia.org/r/171206 made core dumps go to /var/tmp, which is bad for labs since /var is by default tiny (2G!) and also hosts logs. This means just a few core dumps can easily fill up /var (happened to two instances in tools today).

For labs, perhaps we can put them on NFS?

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: acl*sre-team.
yuvipanda changed Security from none to None.
yuvipanda added subscribers: yuvipanda, ori, BBlack.

We could, yes. Then core dumps will just go on /tmp as they do now and fill up the slightly-larger-but-still-not-large 8G root ;)

But that would definitely happen far less frequently.

...I meant disabling core dumps for labs vm's by default.

Ah, hmm. I wouldn't be opposed to that. I'm not sure if people actually use them - they might be useful on deployment-prep for debugging hhvm, perhaps.

Adding Tim & Coren to see if they can chip in

I think https://bugzilla.wikimedia.org/69979 is the consumer side of that. Apart from deployment-prep, I think the default "core" in the current directory is a useful solution as most scripts & Co. will be run out of /home or /data/project. But I certainly would support disabling core files in Labs in general as most users (including me :-)) rarely use them if you don't just happen to be compiling C sources manually.

On Tools, this has become very annoying. Today, I had to purge /var/tmp/core on nine instances as some maintainers seem to run bots without looking after them (bene, intelibot, spbot, totoazero) thus creating a new core dump every few hours. So I'd prefer fixing this very soon.

...I meant disabling core dumps for labs vm's by default.

Sounds right to me.

@coren: Core dumps from tools-webgrid-04 are in /home/yuvipanda/core if you want to investigate.

I'm still not sure of the best way to disable coredumps, even at least just for toollabs.

Ok, so https://gerrit.wikimedia.org/r/173484 will allow us to override kernel path back to just 'core' on toollabs (via hiera), so at least we can return to status quo (CWD core dumps, which end up on NFS, I think)

yuvipanda mentioned this in Unknown Object (Diffusion Commit).Nov 15 2014, 1:55 PM

Patch got merged, should put things back in CWD. Let's see how that goes.

The patch also probably needs at least a bit more cleanup, since it creates directories unnecessarily and sets up tidy on them unnecessarily.

So, no more problems with core dumps filling up /var, but this is still just a 'workaround' - need to figure out how to actually disable them.

Another option is to put them on NFS and then tidy them very aggressively (once every 2 days or so?). Every project has NFS these days. Security might be a concern tho.

yuvipanda mentioned this in Unknown Object (Diffusion Commit).Dec 10 2014, 8:37 PM
fgiunchedi subscribed.

looks like we're close to resolution for this?

More like we have a bandaid that works :)

'Solution' I would say is have some way to make the kernel *not* do
core dumps at all - bsd does this, but linux I can't find a reliable
puppetizable way...

under linux it should be enough to set the process' rlimit_core to 0 to disable core dumps, unless I'm missing something?

betalabs' mediawiki-* instances filled up again, and I've set them to be dumped on NFS https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&diff=139270&oldid=138008

Let's see if this works. Also, all of these are hhvm - are the hhvm core dumps from beta useful at all, or should we disable them?

Also, all of these are hhvm - are the hhvm core dumps from beta useful at all, or should we disable them?

@Joe or @ori or @bd808 ?

In T1259#955291, @greg wrote:

Also, all of these are hhvm - are the hhvm core dumps from beta useful at all, or should we disable them?

@Joe or @ori or @bd808 ?

I think we should be watching for HHVM core dumps everywhere very carefully. Beta is our canary deploy and should help us catch problems before they get to production. What I can't answer is if anyone is watching for these in beta on any regular schedule and taking any action based on them. Sadly many people treat errors found in beta as an annoyance rather than the success of the pre-production testing environment.

@ori, ideas on how to manage these? Should you (or someone else close to HHVM) take a weekly gander at the dumps on beta or something else? Ideally we'd have auto bug reporting with stacktrace or similar a la Apport from Ubuntu (which is user facing, but just a reference/mental model), since that would make it easier to mark as duplicate/close as invalid/etc (basic tracking stuff).

coren removed coren as the assignee of this task.Mar 26 2015, 1:43 PM

The new partitioning scheme has more room in /var for stray core dumps; though this does not address the necessity of cleaning/collecting them as apropriate.

Perhaps a rename of the ticket to refocus the scope to that aspect is in order?

fgiunchedi lowered the priority of this task from High to Medium.Mar 30 2015, 2:45 PM

the immediate issue of /var filling up with core dumps seems fixed, hence priority normal, however the path used for cores doesn't seem to exist (on purpose or typo?)

deployment-mediawiki02:~$ cat /proc/sys/kernel/core_pattern 
/data/project/cores/deployment-mediawiki02-core.%h.%e.%p.%t
deployment-mediawiki02:~$ ls -lad /data/project/cores
ls: cannot access /data/project/cores: No such file or directory
deployment-mediawiki02:~$

also in base the core dump dir cleanup is taken care of

tidy { '/var/tmp/core':
    age     => '1w',
    recurse => 1,
    matches => 'core.*',
}
fgiunchedi renamed this task from Core dumps fill up /var on labs instances to HHVM core dumps in labs.Mar 30 2015, 2:46 PM
greg renamed this task from HHVM core dumps in labs to HHVM core dumps in Beta Cluster.Apr 29 2015, 2:24 PM
hashar subscribed.

I have deleted /data/project/core and created /data/project/cores which is where /proc/sys/kernel/core_pattern points to. Should be fine now.