Page MenuHomePhabricator

snaphot1004 running dumps very slowly, investigate
Closed, ResolvedPublic

Description

I got reports on Wednesday that dumps appeared to be stalling on snapshot1004. The two dumpBackup.php scripts running, one for metawiki and the other for wikidata, were both using large amounts of memory, and the machine was swapping. Investigate and fix.

Event Timeline

ArielGlenn claimed this task.
ArielGlenn raised the priority of this task from to High.
ArielGlenn updated the task description. (Show Details)
ArielGlenn added a project: acl*sre-team.
ArielGlenn added subscribers: ArielGlenn, hoo, greg.

Suspect memory leak in dumpBackup.php and/or all the stuff it uses from mw core. Running the command by hand instead of from the wrapper script shows that it is indeed somewhere in the php code. The host is running php 5.3.10 (precise).

snapshot1004 has also been causing us tons of grief during deploys:

May 6th:
20:14 twentyafterfour: ignore all rumors of scap failures, the scaps were successful, with the exception of snapshot1004.eqiad.wmnet which hangs every time
00:13 bd808: Aborted sync-common on snapshot1004; host is starved for RAM and using swap heavily
May 5th:
23:57 bd808: aborted and restarted sync-common on snapshot1004.eqiad.wmnet manually after waiting 24 minutes with no progress
April 30th:
21:05 bd808: Finally got sync-common to run to completion on snapshot1004; runtime 45 minutes!
17:41 bd808: sync-common on snapshot1004 failed after 33 minutes with rsync timeout
April 29th:
21:04 bd808: load avg on snapshot04 11.11; scap slow waiting on it
April 23rd:
17:28 ori: scap stuck on snapshot1004; not accepting mwdeploy key
greg added a subscriber: bd808.

The grief during deploys was due to swap due to the aforementioned memory leak. Except for April 23, don't know what that was.

In the meantime: please pick up the latest stubs dump of wikidata here: http://dumps.wikimedia.org/wikidatawiki/20150423/ Skip the pieces, I will be tossing them later. That was just a shortcut to getting them out the door sooner. I have started on the content dumps; they should be concluded by tomorrow late or possibly Wednesday early.

Well I now know what this is. It's not a new leak, it's just that the largest single stubs file in our dumps runs is now produced by wikidata! And given that the script runs for over a day, no surprise that php eventually starts eating into swap. I'm writing a workaround now so that we'll never have this issue again for any stubs; it's going to take probably over the weekend to finish up.

Do the partial-dump wikis in the queue have the place indicated or will they jump before those that have had full dumps more recent than their last full dumps?

I have rewritten things to work around the issue and have done a full test run. It looks good. I'll try to get some of the tables run over the next couple days and as far into a full run as I can, one job at a time; after that we'll be doing stubs near the first of the month every month, then the rest of the dump jobs for the same date.

It will become clearer in a few days, when I start the new June runs. :-)

New June runs are underway for all wikis, stubs first. After a couple of days when these are all done I'll do tables next for all wikis, then the rest of the dump jobs.

I've run a full round of both stubs and logs on snapshot1004 and memory usage was nice and low. closing. Ah for future reference the applicable changesets are https://gerrit.wikimedia.org/r/#/c/215666/ and https://gerrit.wikimedia.org/r/#/c/215671/

Why is this closed before we know whether the fix actually fixed things all the way to completion of the runs on which the process choked before.

Opening for now, pending completion of full cycle of runs.

it was closed because the step kiling the box was xmlstubs. I've fixed that.

If this is to be an application open to ordinary Wiktionary project
contributors, then the criterion for closure has to be something like "what
failed ran", not "One person (no matter how skilled!) confidently predicts
that what failed will run".

From the point of view of the implementer of the fix, this may indeed be
closed, but not from a user point of view.

It's not a prediction, it's based on observation after running the problematic step. I ran a full set of stub dumps in the middle of the month already with the new code. I've done a second run already this month. That was the step with the issue; I watched memory very closely and everything was well behaved. In addition we ran 7 processes, slightly more than the usual number. Still fine.

I assumed you had a sound basis for your prediction, but the prediction
isn't the fact.

Do you think you could humor me?

thanks for your work on this ariel! to be fair, the title of the ticket is "snaphot1004 running dumps very slowly", not "wiktionary/(insert other project name) db dumps not available".

But other end-user items were subsumed into this one.

Of course, I appreciate the attention and effort and expect the final results to be everything the ArielGlenn has indicated.

What were the other end-user items? I only looked at the memory use issue.

one end-user related thing i've noticed is that the multistream dumps are still missing: http://dumps.wikimedia.org/enwiktionary/20150602/

That's because the last "stage" of dumps is running across all wikis now. As you see by looking at the index page, projects are changing from "partial dump" to "dump complete". I guess it will be another day or two before it gets to en wiktionary. But was this part of the original ticket?

I've filled this one at the end of May, related to skwiki:
https://phabricator.wikimedia.org/T100877, with I believe was related to
this issue. Not that it has gained any attention at all... :)

Indeed you are right. I've responded on that ticket; if there is more information you need, please ask there, I've claimed it though it's marked resolved.

I suppose that I thought that the incomplete May dumps, eg, no pages-articles dump or enwikt should have been a red flag.

Next time there is such a failure, should just put in a task? Is there something else I should do?

A task assigned to me is great; it may be something I know about, in which case I can explain it (as in this case), or it may be something I missed (hopefully not!) and need to investigate. Thanks! If you don't see a response on a ticket after about a week, feel free to hunt me down in irc and leave me a message there (user name "apergos").

re multistream, no, don't think it was part of the original ticket, just strange that these now get generated almost a week after the initial files, that's why it looked "broken" to me.

ok, looks like the full wiktionary dump has successfully completed now. do you expect future dumps will also have this 10 day window from start to finish? seems rather long.

Just a few more bid wikis to go before we can declare victory, at least in the battle, if not the war.

This schedule for running the dumps makes them a little (about a week for enwikt) more stale than they used by the time they are finished. Are there capacity/efficiency/reliability gains that offset the staleness?

This schedule for running the dumps makes them a little (about a week for enwikt) more stale than they used by the time they are finished. Are there capacity/efficiency/reliability gains that offset the staleness?

Not to disregard the value of your question, but - with all due respect - it again shows why this task should be kept on topic. That topic is (as already pointed out above by @jberkel) the problem described in the task description: That two particular scripts on one server (snapshot1004) were running slow due to a memory leak. I don't see a reason to doubt @ArielGlenn's statement that this memory leak has been fixed, so I'm going to close this now.

The question you are asking was discussed in T89273: Produce stub dumps for all wikis as soon as a new month starts, then generate all other dumps on second round-robin cycle. Feel free to weigh in there, and open new tasks for other issues such as missing dumps for a particular project.