HHVM core dumps in Beta Cluster
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	yuvipanda
	Nov 12 2014, 10:09 PM

Description

https://gerrit.wikimedia.org/r/171206 made core dumps go to /var/tmp, which is bad for labs since /var is by default tiny (2G!) and also hosts logs. This means just a few core dumps can easily fill up /var (happened to two instances in tools today).

For labs, perhaps we can put them on NFS?

Related Objects
Search...

Status	Assigned	Task
Resolved	hashar	T71590 rsync errors to beta cluster, inconsistent state after scap
Resolved	hashar	T71601 Log files on labs instance fill up disk (/var is only 2GB) (tracking)
Resolved	None	T1259 HHVM core dumps in Beta Cluster

Event Timeline

yuvipanda created this task.Nov 12 2014, 10:09 PM

yuvipanda raised the priority of this task from to Needs Triage.

yuvipanda updated the task description. (Show Details)

yuvipanda added a project: acl*sre-team.

yuvipanda changed Security from none to None.

yuvipanda added subscribers: yuvipanda, ori, BBlack.

let's exclude labs alltogether?

We could, yes. Then core dumps will just go on /tmp as they do now and fill up the slightly-larger-but-still-not-large 8G root ;)

But that would definitely happen far less frequently.

...I meant disabling core dumps for labs vm's by default.

Ah, hmm. I wouldn't be opposed to that. I'm not sure if people actually use them - they might be useful on deployment-prep for debugging hhvm, perhaps.

Adding Tim & Coren to see if they can chip in

yuvipanda added subscribers: coren, scfc.Nov 12 2014, 10:25 PM

I think https://bugzilla.wikimedia.org/69979 is the consumer side of that. Apart from deployment-prep, I think the default "core" in the current directory is a useful solution as most scripts & Co. will be run out of /home or /data/project. But I certainly would support disabling core files in Labs in general as most users (including me :-)) rarely use them if you don't just happen to be compiling C sources manually.

On Tools, this has become very annoying. Today, I had to purge /var/tmp/core on nine instances as some maintainers seem to run bots without looking after them (bene, intelibot, spbot, totoazero) thus creating a new core dump every few hours. So I'd prefer fixing this very soon.

paging @ori... :)

yuvipanda assigned this task to coren.Nov 14 2014, 10:40 PM

In T1259#21753, @chasemp wrote:

...I meant disabling core dumps for labs vm's by default.

Sounds right to me.

@coren: Core dumps from tools-webgrid-04 are in /home/yuvipanda/core if you want to investigate.

I'm still not sure of the best way to disable coredumps, even at least just for toollabs.

Ok, so https://gerrit.wikimedia.org/r/173484 will allow us to override kernel path back to just 'core' on toollabs (via hiera), so at least we can return to status quo (CWD core dumps, which end up on NFS, I think)

https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ATools&diff=134584&oldid=132589 is associated hiera change.

yuvipanda mentioned this in Unknown Object (Diffusion Commit).Nov 15 2014, 1:55 PM

Patch got merged, should put things back in CWD. Let's see how that goes.

The patch also probably needs at least a bit more cleanup, since it creates directories unnecessarily and sets up tidy on them unnecessarily.

So, no more problems with core dumps filling up /var, but this is still just a 'workaround' - need to figure out how to actually disable them.

Another option is to put them on NFS and then tidy them very aggressively (once every 2 days or so?). Every project has NFS these days. Security might be a concern tho.

• chasemp unsubscribed.Dec 9 2014, 5:09 AM

yuvipanda added projects: Cloud-Services, Beta-Cluster-Infrastructure.Dec 9 2014, 1:10 PM

yuvipanda mentioned this in Unknown Object (Diffusion Commit).Dec 10 2014, 8:37 PM

looks like we're close to resolution for this?

More like we have a bandaid that works :)

'Solution' I would say is have some way to make the kernel *not* do
core dumps at all - bsd does this, but linux I can't find a reliable
puppetizable way...

under linux it should be enough to set the process' rlimit_core to 0 to disable core dumps, unless I'm missing something?

betalabs' mediawiki-* instances filled up again, and I've set them to be dumped on NFS https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&diff=139270&oldid=138008

Let's see if this works. Also, all of these are hhvm - are the hhvm core dumps from beta useful at all, or should we disable them?

yuvipanda mentioned this in rOPUPecd3dbeb2a66: base: Allow overriding core dump path via hiera.Dec 30 2014, 7:59 PM

In T1259#943783, @yuvipanda wrote:

Also, all of these are hhvm - are the hhvm core dumps from beta useful at all, or should we disable them?

@Joe or @ori or @bd808 ?

In T1259#955291, @greg wrote:

In T1259#943783, @yuvipanda wrote:

Also, all of these are hhvm - are the hhvm core dumps from beta useful at all, or should we disable them?

@Joe or @ori or @bd808 ?

I think we should be watching for HHVM core dumps everywhere very carefully. Beta is our canary deploy and should help us catch problems before they get to production. What I can't answer is if anyone is watching for these in beta on any regular schedule and taking any action based on them. Sadly many people treat errors found in beta as an annoyance rather than the success of the pre-production testing environment.

@ori, ideas on how to manage these? Should you (or someone else close to HHVM) take a weekly gander at the dumps on beta or something else? Ideally we'd have auto bug reporting with stacktrace or similar a la Apport from Ubuntu (which is user facing, but just a reference/mental model), since that would make it easier to mark as duplicate/close as invalid/etc (basic tracking stuff).

greg moved this task from To Triage to Backlog on the Beta-Cluster-Infrastructure board.Jan 12 2015, 10:32 PM

coren moved this task from Triage to Backlog on the Cloud-Services board.Feb 10 2015, 9:23 PM

greg added a parent task: T71601: Log files on labs instance fill up disk (/var is only 2GB) (tracking).Mar 10 2015, 8:53 PM

The new partitioning scheme has more room in /var for stray core dumps; though this does not address the necessity of cleaning/collecting them as apropriate.

Perhaps a rename of the ticket to refocus the scope to that aspect is in order?

the immediate issue of /var filling up with core dumps seems fixed, hence priority normal, however the path used for cores doesn't seem to exist (on purpose or typo?)

deployment-mediawiki02:~$ cat /proc/sys/kernel/core_pattern 
/data/project/cores/deployment-mediawiki02-core.%h.%e.%p.%t
deployment-mediawiki02:~$ ls -lad /data/project/cores
ls: cannot access /data/project/cores: No such file or directory
deployment-mediawiki02:~$

also in base the core dump dir cleanup is taken care of

tidy { '/var/tmp/core':
    age     => '1w',
    recurse => 1,
    matches => 'core.*',
}

fgiunchedi renamed this task from Core dumps fill up /var on labs instances to HHVM core dumps in labs.Mar 30 2015, 2:46 PM

greg renamed this task from HHVM core dumps in labs to HHVM core dumps in Beta Cluster.Apr 29 2015, 2:24 PM

yuvipanda removed a project: Cloud-Services.Jun 14 2015, 12:17 AM

I have deleted /data/project/core and created /data/project/cores which is where /proc/sys/kernel/core_pattern points to. Should be fine now.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 7:05 PM

HHVM core dumps in Beta ClusterClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

HHVM core dumps in Beta Cluster
Closed, ResolvedPublic
Actions

Related Objects
Search...