I got reports on Wednesday that dumps appeared to be stalling on snapshot1004. The two dumpBackup.php scripts running, one for metawiki and the other for wikidata, were both using large amounts of memory, and the machine was swapping. Investigate and fix.
Suspect memory leak in dumpBackup.php and/or all the stuff it uses from mw core. Running the command by hand instead of from the wrapper script shows that it is indeed somewhere in the php code. The host is running php 5.3.10 (precise).
snapshot1004 has also been causing us tons of grief during deploys:
May 6th: 20:14 twentyafterfour: ignore all rumors of scap failures, the scaps were successful, with the exception of snapshot1004.eqiad.wmnet which hangs every time 00:13 bd808: Aborted sync-common on snapshot1004; host is starved for RAM and using swap heavily May 5th: 23:57 bd808: aborted and restarted sync-common on snapshot1004.eqiad.wmnet manually after waiting 24 minutes with no progress April 30th: 21:05 bd808: Finally got sync-common to run to completion on snapshot1004; runtime 45 minutes! 17:41 bd808: sync-common on snapshot1004 failed after 33 minutes with rsync timeout April 29th: 21:04 bd808: load avg on snapshot04 11.11; scap slow waiting on it April 23rd: 17:28 ori: scap stuck on snapshot1004; not accepting mwdeploy key
The grief during deploys was due to swap due to the aforementioned memory leak. Except for April 23, don't know what that was.
In the meantime: please pick up the latest stubs dump of wikidata here: http://dumps.wikimedia.org/wikidatawiki/20150423/ Skip the pieces, I will be tossing them later. That was just a shortcut to getting them out the door sooner. I have started on the content dumps; they should be concluded by tomorrow late or possibly Wednesday early.
Well I now know what this is. It's not a new leak, it's just that the largest single stubs file in our dumps runs is now produced by wikidata! And given that the script runs for over a day, no surprise that php eventually starts eating into swap. I'm writing a workaround now so that we'll never have this issue again for any stubs; it's going to take probably over the weekend to finish up.
I have rewritten things to work around the issue and have done a full test run. It looks good. I'll try to get some of the tables run over the next couple days and as far into a full run as I can, one job at a time; after that we'll be doing stubs near the first of the month every month, then the rest of the dump jobs for the same date.
It will become clearer in a few days, when I start the new June runs. :-)
If this is to be an application open to ordinary Wiktionary project
contributors, then the criterion for closure has to be something like "what
failed ran", not "One person (no matter how skilled!) confidently predicts
that what failed will run".
From the point of view of the implementer of the fix, this may indeed be
closed, but not from a user point of view.
It's not a prediction, it's based on observation after running the problematic step. I ran a full set of stub dumps in the middle of the month already with the new code. I've done a second run already this month. That was the step with the issue; I watched memory very closely and everything was well behaved. In addition we ran 7 processes, slightly more than the usual number. Still fine.
That's because the last "stage" of dumps is running across all wikis now. As you see by looking at the index page, projects are changing from "partial dump" to "dump complete". I guess it will be another day or two before it gets to en wiktionary. But was this part of the original ticket?
I suppose that I thought that the incomplete May dumps, eg, no pages-articles dump or enwikt should have been a red flag.
Next time there is such a failure, should just put in a task? Is there something else I should do?
A task assigned to me is great; it may be something I know about, in which case I can explain it (as in this case), or it may be something I missed (hopefully not!) and need to investigate. Thanks! If you don't see a response on a ticket after about a week, feel free to hunt me down in irc and leave me a message there (user name "apergos").
re multistream, no, don't think it was part of the original ticket, just strange that these now get generated almost a week after the initial files, that's why it looked "broken" to me.
This schedule for running the dumps makes them a little (about a week for enwikt) more stale than they used by the time they are finished. Are there capacity/efficiency/reliability gains that offset the staleness?
Not to disregard the value of your question, but - with all due respect - it again shows why this task should be kept on topic. That topic is (as already pointed out above by @jberkel) the problem described in the task description: That two particular scripts on one server (snapshot1004) were running slow due to a memory leak. I don't see a reason to doubt @ArielGlenn's statement that this memory leak has been fixed, so I'm going to close this now.
The question you are asking was discussed in T89273: Produce stub dumps for all wikis as soon as a new month starts, then generate all other dumps on second round-robin cycle. Feel free to weigh in there, and open new tasks for other issues such as missing dumps for a particular project.