Page MenuHomePhabricator

jobrunner memory leaks
Closed, ResolvedPublic

Description

Since approximately Dec 15th there has been an increase in the rate jobrunners consume their memory. The effects appears to be cumulative, i.e. trending to exhaust all memory on the system and OOM (which has happened already on most of them). The memory levels of the jobrunners before used to be fairly steady, so this is new.

This seems to correlate with two changes happening that day:

@ori has restarted mw1015 with jemalloc profiling and is collecting heap stats.

In the meantime, I restarted HHVM across the whole jobrunner fleet to avoid a fleet-wide OOM. Extrapolating from the current trend it appears that we'll get to the OOM threshold again (unless we restart again) in approximately 4 days.

Keyword: jobqueue, job queue

Event Timeline

faidon created this task.Dec 21 2015, 7:26 PM
faidon updated the task description. (Show Details)
faidon raised the priority of this task from to Unbreak Now!.
faidon added subscribers: faidon, ori, Joe.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 21 2015, 7:26 PM
Krinkle set Security to None.Dec 21 2015, 7:47 PM
Krinkle moved this task from Inbox to Doing on the Performance-Team board.
Krinkle added a subscriber: aaron.
aaron added a comment.Dec 21 2015, 8:57 PM

The first change only effected JobChron. From http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1450730658.495&target=jobrunner.memory.*.count&from=-7days jobchron and jobrunner have stable memory use. The total use on mw1015 (for example) is low:

15781 www-data  20   0  843844 256568   5952 S   0.3  2.1  48:18.89 /usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobChronService --config-file=/etc/jobrunner/jobrunn+ 
12555 www-data  20   0  648312  60208   6200 S   8.6  0.5  76:59.66 /usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobRunnerService --config-file=/etc/jobrunner/jobrun+

Incidentally, I noticed that hhvm hangs on startup on mw1001 and those two processes were not running there either. Using php5 worked fine. mw101[56] didn't have this problem.

I assume the OOMs are for the actual web server hhvm then.

ori added a comment.Dec 22 2015, 12:13 AM

Diffing heaps on mw1015 points to xmlSearchNsByHref being the culprit.

ori added a comment.EditedDec 22 2015, 2:00 AM

I disabled puppet on mw1014 and mw1015 and excluded specific job types on each of them to see if it helps isolate the cause.

  • on mw1015: excluded cirrusSearchLinksUpdate and cirrusSearchLinksUpdatePrioritized
  • on mw1014: set restbase runners to 0

edit: ...but I forgot to restart HHVM. Doing so now (04:47 UTC).

Just in case it may be helpful: P2446

I disabled puppet on mw1014 and mw1015 and excluded specific job types on each of them to see if it helps isolate the cause.

  • on mw1015: excluded cirrusSearchLinksUpdate and cirrusSearchLinksUpdatePrioritized
  • on mw1014: set restbase runners to 0

FWIW, this doesn't seem to have had an effect.

ori added a comment.Dec 23 2015, 10:49 AM

I disabled puppet on mw1014 and mw1015 and excluded specific job types on each of them to see if it helps isolate the cause.

  • on mw1015: excluded cirrusSearchLinksUpdate and cirrusSearchLinksUpdatePrioritized
  • on mw1014: set restbase runners to 0

FWIW, this doesn't seem to have had an effect.

This ran for 12 hours (I used my time-bounded Puppet disabling trick), during which memory usage on mw1015 was consistently higher than mw1014: http://graphite.wikimedia.org/S/a

It is not decisive, but it'd be worthwhile to repeat this experiment and have it run a little longer this time.

aaron added a comment.Dec 24 2015, 8:53 AM

I'd assume you'd also want to exclude refreshLinks.

The first change only effected JobChron. From http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1450730658.495&target=jobrunner.memory.*.count&from=-7days jobchron and jobrunner have stable memory use.

The count property represents the number of data points submitted. It does not reflect the value. In theory this could represent uptime, but I don't think it reflects stable memory usage.

See also https://wikitech.wikimedia.org/wiki/Graphite#Extended_properties for more information.

I do not know if the test is over (but puppet is still disabled). If that is not the case, mw1015 has just freezed again right now (presumably because of OOM).

I've restarted HHVM on jobrunners thrice now, to avoid further OOMs (one cut it real close too). I'd like to revert the two commits that we identified above as possible suspects of this leak. I'd prefer playing it safe, especially now, given the limited availability of people around the holidays.

Any objections?

I've restarted HHVM on jobrunners thrice now, to avoid further OOMs (one cut it real close too). I'd like to revert the two commits that we identified above as possible suspects of this leak. I'd prefer playing it safe, especially now, given the limited availability of people around the holidays.

Any objections?

Reverting 6483f1ad828b49 would probably cause problems with job pickup, those services it runs in their own hhvm process rather than using the fcgi server, and the service that curls to the fcgi server was not changed, so I wouldn't recommend that.

Change 261104 had a related patch set uploaded (by Ori.livneh):
Restart HHVM on the jobrunners daily, as temp. workaround for T122069

https://gerrit.wikimedia.org/r/261104

ori added a comment.Dec 27 2015, 6:57 AM

I've restarted HHVM on jobrunners thrice now, to avoid further OOMs (one cut it real close too). I'd like to revert the two commits that we identified above as possible suspects of this leak. I'd prefer playing it safe, especially now, given the limited availability of people around the holidays.

Any objections?

I718b3a1e4 ("Restart HHVM on the jobrunners daily, as temp. workaround for T122069") would be substantially safer.

Change 261104 merged by Ori.livneh:
Restart HHVM on the jobrunners daily, as temp. workaround for T122069

https://gerrit.wikimedia.org/r/261104

ori added a comment.Jan 10 2016, 8:24 AM

The time it takes each job runner to OOM has been steadily shrinking, so restarting once a day is now inadequate.

jcrespo added a comment.EditedJan 10 2016, 4:06 PM

:-/ https://grafana.wikimedia.org/dashboard/db/job-queue-health

Jobs queued went from 8K to 3 million in the last week.

Samtar added a subscriber: Samtar.Jan 10 2016, 4:10 PM

@jcrespo getting a tonne of timeout alerts for RAID etc in #wikimedia-operations for mw1008, mw1015 and mw1004 - related?

ori added a comment.Jan 11 2016, 7:30 PM

What we know

  • The rate of memory growth after restarting HHVM has been increasing over the past two weeks:

  • Memory growth appears to have suddenly flattened out at 06:26 UTC on Jan 11:

No deployments occurred during this period, so the changes in the rate of memory growth are not attributable to code changes. It must have something to do with the state of the queue (how many jobs of each type there are). This is why I still think a specific job type is responsible, even though I have not been able to isolate it.

CategoryMembershipChangeJob is suspect, because it is new, having rolled out to production on Dec 8, which roughly fits the timeline. However, when I disabled the job on mw1166 and restarted HHVM (on 08:55 UTC on Jan 10), the problem did not go away.

(I'll add more notes to this comment shortly.)

06:25 UTC is cron.daily, which includes, among others, logrotate. We have three HHVM/MediaWiki-related logrotates, but only /etc/logrotate.d/mediawiki_jobrunner seems to be relevant here: it issues a /sbin/restart jobrunner after rotating the logs, which restarts the redisJobRunnerService HHVM.

ori added a comment.Jan 12 2016, 7:41 AM

To try to detect what caused the change at ~6:30, I counted jobs by type for the two hours before and after the change. I did this on mw1166, where the pattern was particularly clear:

The results are available on Google Docs (open access).

ori added a comment.Jan 12 2016, 9:44 PM

Interesting that gwtoolsetUpload* jobs ran at 91-97% less rate (almost not running) in the later period of the above graph that the former period.

The gwtoolset jobs process XML, too, so this would at least be consistent with what I saw when diffing heaps (T122069#1896643).

Change 263724 had a related patch set uploaded (by Ori.livneh):
Disable gwtoolsetUpload* jobs on even-numbered jobrunners

https://gerrit.wikimedia.org/r/263724

Change 263724 merged by Ori.livneh:
Disable gwtoolsetUpload* jobs on even-numbered jobrunners

https://gerrit.wikimedia.org/r/263724

Possibly related to https://github.com/facebook/hhvm/issues/3899 (which is the same pattern the GWT code uses).

Change 263751 had a related patch set uploaded (by Aaron Schulz):
Make sure XMLReader::close() is always called

https://gerrit.wikimedia.org/r/263751

Change 263776 had a related patch set uploaded (by Aaron Schulz):
Let different jobs use different dispatchers

https://gerrit.wikimedia.org/r/263776

Change 263751 merged by jenkins-bot:
Make sure XMLReader::close() is always called

https://gerrit.wikimedia.org/r/263751

ori added a comment.Jan 13 2016, 2:10 AM

Yep, it's GWT:

Change 263811 had a related patch set uploaded (by Ori.livneh):
Make sure XMLReader::close() is always called

https://gerrit.wikimedia.org/r/263811

Change 263812 had a related patch set uploaded (by Ori.livneh):
Make sure XMLReader::close() is always called

https://gerrit.wikimedia.org/r/263812

Change 263811 merged by jenkins-bot:
Make sure XMLReader::close() is always called

https://gerrit.wikimedia.org/r/263811

Change 263812 merged by jenkins-bot:
Make sure XMLReader::close() is always called

https://gerrit.wikimedia.org/r/263812

ori added a comment.Jan 14 2016, 1:07 AM

I think I have it isolated -- it's XMLReader::expand:

T122069.php
<?php
# To run:
# hhvm -vEval.Jit=1 --no-config --count 10 T122069.php

$xml = <<<XML
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<schema xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <record>
    <FAMILY>Dicruridae</FAMILY>
    <GENUS>Dicrurus</GENUS>
  </record>
</schema>
XML;


// The loop isn't necessary, but it helps make the maxrss delta between
// runs more obvious.

for ( $i = 0; $i < 1000; $i++ ) {
    $reader = new XMLReader();
    $reader->xml( $xml );
    $reader->read();
    $reader->expand(); // this is the crucial bit
    $reader->close();
}

echo getrusage()['ru_maxrss'] . "\n";
ori added a comment.Jan 14 2016, 6:49 AM

This leak was reported and fixed in 2014; the example I pasted above does not leak on 3.11. However, rushing to upgrade HHVM across the fleet because of this issue would be unwise, since we have only done minimal testing of 3.11 with our workload. We need a fix for a few weeks. The best idea so far is to segregate GWToolset jobs to a small subset of the jobrunner pool (two servers), and use cron to restart HHVM periodically on those machines.

Change 264055 had a related patch set uploaded (by Giuseppe Lavagetto):
jobrunner: contain gwt jobs to run on two specific hosts

https://gerrit.wikimedia.org/r/264055

Change 264055 merged by Giuseppe Lavagetto:
jobrunner: contain gwt jobs to run on two specific hosts

https://gerrit.wikimedia.org/r/264055

ori lowered the priority of this task from Unbreak Now! to Normal.Jan 14 2016, 4:56 PM

Change 263776 abandoned by Ori.livneh:
Let different jobs use different dispatchers

Reason:
Wrong approach

https://gerrit.wikimedia.org/r/263776

Krinkle moved this task from Doing to Inbox on the Performance-Team board.Mar 17 2016, 8:42 PM

@aaron says we recently upgraded to HHVM 3.12 which presumably contains the fix for 3.11 as well.

Active memory still shows the sawtooth pattern, not sure if it's better or not...

aaron removed aaron as the assignee of this task.Jul 9 2016, 12:45 PM
Joe closed this task as Resolved.Sep 5 2017, 8:23 AM