For years, probably forever, the dumps have been run by hand out of screen, restarted by hand when they break. This is partly due to the rolling nature of the dumps; there was never the concept of being 'done' with a full run. But it's partly due to the dumps being a smallish project back in the day. Given staged dumps, we are now in the position of running out of cron twice a month on each snapshot host.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | ArielGlenn | T107750 Make dumps run via cron on each snapshot host | |||
Resolved | ArielGlenn | T107757 staged dumps implementation | |||
Resolved | ArielGlenn | T107758 allow dumps to be treated as 'done' even though some steps are skipped | |||
Resolved | ArielGlenn | T107759 worker bash script terminates early when there are still more wikis to run | |||
Resolved | ArielGlenn | T108077 copy partial dumps from dataset host to labs | |||
Resolved | ArielGlenn | T110305 staged dumps: use the "cutoff" option as little as possible | |||
Resolved | ArielGlenn | T107760 need script that handles all bash worker scripts on a given snapshot, per stage, rerunning failures as appropriate, managing resources as appropriate | |||
Resolved | ArielGlenn | T107767 move some wikis from small to big dumps config | |||
Resolved | ArielGlenn | T107860 generate command lists for dump scheduler | |||
Resolved | ArielGlenn | T110888 redo dumps monitor so it runs as a service |
Event Timeline
Things that need to be done for this to happen:
make sure we can skip some jobs on a particular run and still call the dump 'complete' (lets us run dumps without full content for all revisions as 'complete dumps' since most users don't need those)
have a script that runs a batch of commands in sequence, keeps track of returns and flags them for email notification or rerun or skipping, and can recover if it dies; this replaces running the worker bash script out of 8 screens or 4 or 1 depending on the host, watching each stage to see when it completes, running the next stage manually and so on
There is some set of circumstances that causes the worker bash script to believe there are failures for a run when no dump was run for any wiki; maybe a race condition but it causes the script to exit early, which may leave some wikis not dumped for that stage.
What's left:
- * clean up the cron classes in the last commit https://gerrit.wikimedia.org/r/#/c/263807/ so that all conf file and other dependencies are called out
- * make the invocations do a pgrep to make sure no previous instance is still running, etc
- * make sure we have all dump stages lists required for these cron jobs
- * document the intended use of the script since in about 5 minutes I'll forget why I wrote it that way
- * remove extra whitespace from script, any other small cleanup
- * check that we do the right thing in case a new wiki has never been dumped but all other wikis have completed their run for the month (the right thing in this case is not to run anything)
- * check that we do the right thing in case a wiki dump run failed in its run for the month (the right thing is no new dump dir creation but rerunning all wikis for the existing run, those that are complete will be skipped and only the failed/missing steps on the one wiki rerun)
- * redirect output to /dev/null, we already log. otherwise the spam mail messages will be ginormous
After all that, enable the jobs. They won't run until next month, and they will be for the full monthly run only. We can add the partial run to cron when we have the replacement eqiad hardware; right now I have to juggle the steps of the second monthly run manually to get everything done in time.
These jobs are now all enabled. The first attempt to run will be Feb 2 early in the morning. I'll be checking to make sure everything started properly.
One catch is that, because more than one hosts does the dumps of the 'regular' (not en wikipedia) wikis, if one host is still running a previous dump run but the other has completed it, we can have problems. Specifically, one host will prepare all unlocked wikis for the new run, but it won't be able to prepare the wikis with dumps in progress. The second host, when it finally completes its jobs from the previous dump run, will run the cron job and try to prepare the wikis again with a different start date. This is not a blocker, but it does mean I need to run the second monthly run manually and watch its completion, til that is fixed up.
I should have updated this this morning. Anyways, cron jobs didn't start up the dumps because of a silly typo in the script. Fixed here: https://gerrit.wikimedia.org/r/#/c/267843/
Small/big wiki dumps were hung on start of dump run for labtestwiki which can't be dumped from snapshots. I added that to the list of dbs to skip and those jobs are now rolling. https://gerrit.wikimedia.org/r/#/c/268064/
Third time's a charm, after a rewrite, adding a dryrun option and a ton more testing on all the hosts of the dumps cron script we should be set to pick up tomorrow where we left off last night. This morning I found that small/big wikis had started from scratch with a new day's run, absolutely not desired behavior. Part ofhtat was my not deploying the current versions of the dumpadmin.py and WikiDump.py files over to the snapshot hosts for small/bin/wikis. At any rate, that's all been cleaned up, the rewrite of the script is here and has been merged: https://gerrit.wikimedia.org/r/#/c/268433/
And the saga goes on. pgrep works differently depending on whether the script you're searching for was run with /bin/bash scriptname or not. Grrrrr! So now I get all running instances and check for those that don't have the pgroup of the script doing the checking. Latest change (tested as thoroughly as I could with and without the cron job already running): https://gerrit.wikimedia.org/r/#/c/268659/
Sure hope this is the last of it.
Why do I forget to log success here? Anyways the jobs are humming along so I can finally close this. No more screen sessions!