Maniphest T122069

jobrunner memory leaks
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	faidon
	Dec 21 2015, 7:26 PM

Description

Since approximately Dec 15th there has been an increase in the rate jobrunners consume their memory. The effects appears to be cumulative, i.e. trending to exhaust all memory on the system and OOM (which has happened already on most of them). The memory levels of the jobrunners before used to be fairly steady, so this is new.

This seems to correlate with two changes happening that day:

Make use of the per-server jobqueue:s-queuesWithJobs key (6483f1ad828b49070f4c86b1abb5b8b97105b1c2)
Convert mw1162-1169 to job runners (ba0a47b56ded3dc748c436c2940f114389b312f2)

@ori has restarted mw1015 with jemalloc profiling and is collecting heap stats.

In the meantime, I restarted HHVM across the whole jobrunner fleet to avoid a fleet-wide OOM. Extrapolating from the current trend it appears that we'll get to the OOM threshold again (unless we restart again) in approximately 4 days.

Keyword: jobqueue, job queue

Details

Subject	Repo	Branch	Lines +/-
Let different jobs use different dispatchers	mediawiki/services/jobrunner	master	+17 -9
jobrunner: contain gwt jobs to run on two specific hosts	operations/puppet	production	+20 -7
Make sure XMLReader::close() is always called	mediawiki/extensions/GWToolset	wmf/1.27.0-wmf.9	+7 -0
Make sure XMLReader::close() is always called	mediawiki/extensions/GWToolset	wmf/1.27.0-wmf.10	+7 -0
Make sure XMLReader::close() is always called	mediawiki/extensions/GWToolset	master	+7 -0
Disable gwtoolsetUpload* jobs on even-numbered jobrunners	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	ori	T124194 Job queue is growing and growing
Resolved	None	T122069 jobrunner memory leaks
Resolved	aaron	T123284 Record per-job-type memory usage statistics

Event Timeline

faidon created this task.Dec 21 2015, 7:26 PM

faidon raised the priority of this task from to Unbreak Now!.

faidon updated the task description. (Show Details)

faidon added projects: SRE, Performance-Team.

faidon added subscribers: faidon, ori, Joe.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 21 2015, 7:26 PM

Krinkle set Security to None.Dec 21 2015, 7:47 PM

Krinkle moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

Krinkle added a subscriber: aaron.

The first change only effected JobChron. From http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1450730658.495&target=jobrunner.memory.*.count&from=-7days jobchron and jobrunner have stable memory use. The total use on mw1015 (for example) is low:

15781 www-data  20   0  843844 256568   5952 S   0.3  2.1  48:18.89 /usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobChronService --config-file=/etc/jobrunner/jobrunn+ 
12555 www-data  20   0  648312  60208   6200 S   8.6  0.5  76:59.66 /usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobRunnerService --config-file=/etc/jobrunner/jobrun+

Incidentally, I noticed that hhvm hangs on startup on mw1001 and those two processes were not running there either. Using php5 worked fine. mw101[56] didn't have this problem.

I assume the OOMs are for the actual web server hhvm then.

Diffing heaps on mw1015 points to xmlSearchNsByHref being the culprit.

I disabled puppet on mw1014 and mw1015 and excluded specific job types on each of them to see if it helps isolate the cause.

on mw1015: excluded cirrusSearchLinksUpdate and cirrusSearchLinksUpdatePrioritized
on mw1014: set restbase runners to 0

edit: ...but I forgot to restart HHVM. Doing so now (04:47 UTC).

Steinsplitter subscribed.Dec 22 2015, 11:37 AM

Just in case it may be helpful: P2446

jcrespo mentioned this in T121623: Job runners throw lots of "Can't connect to MySQL server" exceptions.Dec 22 2015, 11:51 AM

In T122069#1896872, @ori wrote:

I disabled puppet on mw1014 and mw1015 and excluded specific job types on each of them to see if it helps isolate the cause.

on mw1015: excluded cirrusSearchLinksUpdate and cirrusSearchLinksUpdatePrioritized

on mw1014: set restbase runners to 0

FWIW, this doesn't seem to have had an effect.

Glaisher subscribed.Dec 23 2015, 8:59 AM

In T122069#1900375, @faidon wrote:

In T122069#1896872, @ori wrote:

I disabled puppet on mw1014 and mw1015 and excluded specific job types on each of them to see if it helps isolate the cause.

on mw1015: excluded cirrusSearchLinksUpdate and cirrusSearchLinksUpdatePrioritized

on mw1014: set restbase runners to 0

FWIW, this doesn't seem to have had an effect.

This ran for 12 hours (I used my time-bounded Puppet disabling trick), during which memory usage on mw1015 was consistently higher than mw1014: http://graphite.wikimedia.org/S/a

It is not decisive, but it'd be worthwhile to repeat this experiment and have it run a little longer this time.

I'd assume you'd also want to exclude refreshLinks.

In T122069#1895995, @aaron wrote:

The first change only effected JobChron. From http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1450730658.495&target=jobrunner.memory.*.count&from=-7days jobchron and jobrunner have stable memory use.

The count property represents the number of data points submitted. It does not reflect the value. In theory this could represent uptime, but I don't think it reflects stable memory usage.

See also https://wikitech.wikimedia.org/wiki/Graphite#Extended_properties for more information.

I do not know if the test is over (but puppet is still disabled). If that is not the case, mw1015 has just freezed again right now (presumably because of OOM).

I've restarted HHVM on jobrunners thrice now, to avoid further OOMs (one cut it real close too). I'd like to revert the two commits that we identified above as possible suspects of this leak. I'd prefer playing it safe, especially now, given the limited availability of people around the holidays.

Any objections?

In T122069#1904624, @faidon wrote:

I've restarted HHVM on jobrunners thrice now, to avoid further OOMs (one cut it real close too). I'd like to revert the two commits that we identified above as possible suspects of this leak. I'd prefer playing it safe, especially now, given the limited availability of people around the holidays.

Any objections?

Reverting 6483f1ad828b49 would probably cause problems with job pickup, those services it runs in their own hhvm process rather than using the fcgi server, and the service that curls to the fcgi server was not changed, so I wouldn't recommend that.

Krenair subscribed.Dec 27 2015, 3:49 AM

Change 261104 had a related patch set uploaded (by Ori.livneh):
Restart HHVM on the jobrunners daily, as temp. workaround for T122069

https://gerrit.wikimedia.org/r/261104

gerritbot added a project: Patch-For-Review.Dec 27 2015, 6:52 AM

In T122069#1904624, @faidon wrote:

I've restarted HHVM on jobrunners thrice now, to avoid further OOMs (one cut it real close too). I'd like to revert the two commits that we identified above as possible suspects of this leak. I'd prefer playing it safe, especially now, given the limited availability of people around the holidays.

Any objections?

I718b3a1e4 ("Restart HHVM on the jobrunners daily, as temp. workaround for T122069") would be substantially safer.

Change 261104 merged by Ori.livneh:
Restart HHVM on the jobrunners daily, as temp. workaround for T122069

https://gerrit.wikimedia.org/r/261104

Nemo_bis updated the task description. (Show Details)Dec 28 2015, 7:17 AM

Nemo_bis added a project: WMF-General-or-Unknown.

Trijnstel subscribed.Jan 2 2016, 1:50 PM

The time it takes each job runner to OOM has been steadily shrinking, so restarting once a day is now inadequate.

:-/ https://grafana.wikimedia.org/dashboard/db/job-queue-health

Jobs queued went from 8K to 3 million in the last week.

@jcrespo getting a tonne of timeout alerts for RAID etc in #wikimedia-operations for mw1008, mw1015 and mw1004 - related?

Addshore subscribed.Jan 10 2016, 4:59 PM

Legoktm subscribed.Jan 11 2016, 6:53 PM

What we know

The rate of memory growth after restarting HHVM has been increasing over the past two weeks:

Memory growth appears to have suddenly flattened out at 06:26 UTC on Jan 11:

memory-11-jan-2016.png (400×800 px, 74 KB)

No deployments occurred during this period, so the changes in the rate of memory growth are not attributable to code changes. It must have something to do with the state of the queue (how many jobs of each type there are). This is why I still think a specific job type is responsible, even though I have not been able to isolate it.

CategoryMembershipChangeJob is suspect, because it is new, having rolled out to production on Dec 8, which roughly fits the timeline. However, when I disabled the job on mw1166 and restarted HHVM (on 08:55 UTC on Jan 10), the problem did not go away.

(I'll add more notes to this comment shortly.)

ori mentioned this in T123284: Record per-job-type memory usage statistics.Jan 11 2016, 8:39 PM

06:25 UTC is cron.daily, which includes, among others, logrotate. We have three HHVM/MediaWiki-related logrotates, but only /etc/logrotate.d/mediawiki_jobrunner seems to be relevant here: it issues a /sbin/restart jobrunner after rotating the logs, which restarts the redisJobRunnerService HHVM.

• MZMcBride subscribed.Jan 12 2016, 5:25 AM

To try to detect what caused the change at ~6:30, I counted jobs by type for the two hours before and after the change. I did this on mw1166, where the pattern was particularly clear:

The results are available on Google Docs (open access).

Interesting that gwtoolsetUpload* jobs ran at 91-97% less rate (almost not running) in the later period of the above graph that the former period.

The saw-tooth pattern of memory build up and restart seems to mesh (proportionately) with the integral of the run rate of those jobs:
http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1452630580.203&from=00%3A00_20160101&until=23%3A59_20160112&target=movingAverage(servers.mw1010.memory.Active%2C10)&target=scale(integral(MediaWiki.jobqueue.run.gwtoolsetUpload*.count)%2C1e5).

http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1452636031.651&from=-90days&target=servers.mw1010.memory.Active&target=scale(MediaWiki.jobqueue.run.gwtoolsetUploadMetadataJob.count%2C1e7)

I'd expect that if each job leaked a similar amount of memory. Other jobs, like refreshLinks do not show this correspondence, e.g.:
http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1452631318.01&from=00%3A00_20160101&until=23%3A59_20160112&target=movingAverage(servers.mw1010.memory.Active%2C10)&target=scale(integral(MediaWiki.jobqueue.run.refreshLinks*.count)%2C1e1)

In T122069#1929172, @aaron wrote:

Interesting that gwtoolsetUpload* jobs ran at 91-97% less rate (almost not running) in the later period of the above graph that the former period.

The gwtoolset jobs process XML, too, so this would at least be consistent with what I saw when diffing heaps (T122069#1896643).

Change 263724 had a related patch set uploaded (by Ori.livneh):
Disable gwtoolsetUpload* jobs on even-numbered jobrunners

https://gerrit.wikimedia.org/r/263724

Change 263724 merged by Ori.livneh:
Disable gwtoolsetUpload* jobs on even-numbered jobrunners

https://gerrit.wikimedia.org/r/263724

Possibly related to https://github.com/facebook/hhvm/issues/3899 (which is the same pattern the GWT code uses).

Change 263751 had a related patch set uploaded (by Aaron Schulz):
Make sure XMLReader::close() is always called

https://gerrit.wikimedia.org/r/263751

Change 263776 had a related patch set uploaded (by Aaron Schulz):
Let different jobs use different dispatchers

https://gerrit.wikimedia.org/r/263776

Change 263751 merged by jenkins-bot:
Make sure XMLReader::close() is always called

https://gerrit.wikimedia.org/r/263751

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2016-01-19_(1.27.0-wmf.11)).Jan 13 2016, 2:00 AM

Yep, it's GWT:

Change 263811 had a related patch set uploaded (by Ori.livneh):
Make sure XMLReader::close() is always called

https://gerrit.wikimedia.org/r/263811

Change 263812 had a related patch set uploaded (by Ori.livneh):
Make sure XMLReader::close() is always called

https://gerrit.wikimedia.org/r/263812

Change 263811 merged by jenkins-bot:
Make sure XMLReader::close() is always called

https://gerrit.wikimedia.org/r/263811

Change 263812 merged by jenkins-bot:
Make sure XMLReader::close() is always called

https://gerrit.wikimedia.org/r/263812

ReleaseTaggerBot added projects: MW-1.27-release (WMF-deploy-2015-12-15_(1.27.0-wmf.9)), MW-1.27-release (WMF-deploy-2016-01-12_(1.27.0-wmf.10)).Jan 13 2016, 6:00 AM

fgiunchedi subscribed.Jan 13 2016, 2:22 PM

aaron closed subtask T123284: Record per-job-type memory usage statistics as Resolved.Jan 13 2016, 11:53 PM

I think I have it isolated -- it's XMLReader::expand:

T122069.php

<?php
# To run:
# hhvm -vEval.Jit=1 --no-config --count 10 T122069.php

$xml = <<<XML
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<schema xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <record>
    <FAMILY>Dicruridae</FAMILY>
    <GENUS>Dicrurus</GENUS>
  </record>
</schema>
XML;


// The loop isn't necessary, but it helps make the maxrss delta between
// runs more obvious.

for ( $i = 0; $i < 1000; $i++ ) {
    $reader = new XMLReader();
    $reader->xml( $xml );
    $reader->read();
    $reader->expand(); // this is the crucial bit
    $reader->close();
}

echo getrusage()['ru_maxrss'] . "\n";

Agabi10 subscribed.Jan 14 2016, 4:09 AM

This leak was reported and fixed in 2014; the example I pasted above does not leak on 3.11. However, rushing to upgrade HHVM across the fleet because of this issue would be unwise, since we have only done minimal testing of 3.11 with our workload. We need a fix for a few weeks. The best idea so far is to segregate GWToolset jobs to a small subset of the jobrunner pool (two servers), and use cron to restart HHVM periodically on those machines.

Change 264055 had a related patch set uploaded (by Giuseppe Lavagetto):
jobrunner: contain gwt jobs to run on two specific hosts

https://gerrit.wikimedia.org/r/264055

Change 264055 merged by Giuseppe Lavagetto:
jobrunner: contain gwt jobs to run on two specific hosts

https://gerrit.wikimedia.org/r/264055

ori lowered the priority of this task from Unbreak Now! to Medium.Jan 14 2016, 4:56 PM

ori added a parent task: T124194: Job queue is growing and growing.Jan 21 2016, 1:43 AM

Change 263776 abandoned by Ori.livneh:
Let different jobs use different dispatchers

Reason:
Wrong approach

https://gerrit.wikimedia.org/r/263776

Krinkle moved this task from Doing (old) to Inbox, needs triage on the Performance-Team board.Mar 17 2016, 8:42 PM

@aaron says we recently upgraded to HHVM 3.12 which presumably contains the fix for 3.11 as well.

Krinkle assigned this task to aaron.Mar 28 2016, 7:10 PM

Krinkle moved this task from Inbox, needs triage to To-do: Goals prioritized current Quarter on the Performance-Team board.

Krinkle removed projects: MW-1.27-release (WMF-deploy-2016-01-12_(1.27.0-wmf.10)), MW-1.27-release (WMF-deploy-2015-12-15_(1.27.0-wmf.9)), MW-1.27-release (WMF-deploy-2016-01-19_(1.27.0-wmf.11)), Patch-For-Review.

Active memory still shows the sawtooth pattern, not sure if it's better or not...

elukey subscribed.May 24 2016, 5:48 PM

aaron moved this task from To-do: Goals prioritized current Quarter to Backlog: Maintenance, non-prioritized on the Performance-Team board.Jun 9 2016, 10:47 PM

aaron removed aaron as the assignee of this task.Jul 9 2016, 12:45 PM

Krinkle edited projects, added WMF-JobQueue; removed Performance-Team.Dec 6 2016, 12:34 AM

Joe closed this task as Resolved.Sep 5 2017, 8:23 AM

	F3221921: download.png
	Jan 13 2016, 2:10 AM

	F3220227: download.png
	Jan 12 2016, 7:41 AM

	F3219173: memory-jan-2016.png
	Jan 11 2016, 7:30 PM

jobrunner memory leaksClosed, ResolvedPublicActions