Make dumps run via cron on each snapshot host
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ArielGlenn
	Aug 3 2015, 12:59 PM

Description

For years, probably forever, the dumps have been run by hand out of screen, restarted by hand when they break. This is partly due to the rolling nature of the dumps; there was never the concept of being 'done' with a full run. But it's partly due to the dumps being a smallish project back in the day. Given staged dumps, we are now in the position of running out of cron twice a month on each snapshot host.

Related Objects
Search...

Status	Assigned	Task
Resolved	ArielGlenn	T107750 Make dumps run via cron on each snapshot host
Resolved	ArielGlenn	T107757 staged dumps implementation
Resolved	ArielGlenn	T107758 allow dumps to be treated as 'done' even though some steps are skipped
Resolved	ArielGlenn	T107759 worker bash script terminates early when there are still more wikis to run
Resolved	ArielGlenn	T108077 copy partial dumps from dataset host to labs
Resolved	ArielGlenn	T110305 staged dumps: use the "cutoff" option as little as possible
Resolved	ArielGlenn	T107760 need script that handles all bash worker scripts on a given snapshot, per stage, rerunning failures as appropriate, managing resources as appropriate
Resolved	ArielGlenn	T107767 move some wikis from small to big dumps config
Resolved	ArielGlenn	T107860 generate command lists for dump scheduler
Resolved	ArielGlenn	T110888 redo dumps monitor so it runs as a service

Event Timeline

ArielGlenn created this task.Aug 3 2015, 12:59 PM

ArielGlenn claimed this task.

ArielGlenn raised the priority of this task from to High.

ArielGlenn updated the task description. (Show Details)

ArielGlenn added a project: acl*sre-team.

ArielGlenn subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptAug 3 2015, 12:59 PM

Things that need to be done for this to happen:

make sure we can skip some jobs on a particular run and still call the dump 'complete' (lets us run dumps without full content for all revisions as 'complete dumps' since most users don't need those)

have a script that runs a batch of commands in sequence, keeps track of returns and flags them for email notification or rerun or skipping, and can recover if it dies; this replaces running the worker bash script out of 8 screens or 4 or 1 depending on the host, watching each stage to see when it completes, running the next stage manually and so on

There is some set of circumstances that causes the worker bash script to believe there are failures for a run when no dump was run for any wiki; maybe a race condition but it causes the script to exit early, which may leave some wikis not dumped for that stage.

ArielGlenn added a subtask: T107757: staged dumps implementation.Aug 3 2015, 2:20 PM

ArielGlenn added a subtask: T107760: need script that handles all bash worker scripts on a given snapshot, per stage, rerunning failures as appropriate, managing resources as appropriate.Aug 3 2015, 2:24 PM

ArielGlenn added a subtask: T107767: move some wikis from small to big dumps config.Aug 3 2015, 3:48 PM

ArielGlenn closed subtask T107767: move some wikis from small to big dumps config as Resolved.Aug 4 2015, 9:17 AM

ArielGlenn added a subtask: T107860: generate command lists for dump scheduler.Aug 4 2015, 9:51 AM

ArielGlenn closed subtask T107860: generate command lists for dump scheduler as Resolved.Aug 5 2015, 6:28 PM

DCDuring subscribed.Aug 5 2015, 10:34 PM

Meno25 subscribed.Aug 6 2015, 1:46 PM

wpmirrordev subscribed.Aug 6 2015, 9:18 PM

ArielGlenn added a subtask: T110888: redo dumps monitor so it runs as a service .Aug 31 2015, 1:28 PM

ArielGlenn closed subtask T107760: need script that handles all bash worker scripts on a given snapshot, per stage, rerunning failures as appropriate, managing resources as appropriate as Resolved.Sep 14 2015, 8:45 AM

ArielGlenn closed subtask T107757: staged dumps implementation as Resolved.

ArielGlenn added a project: Dumps-Generation.Sep 28 2015, 12:20 PM

ArielGlenn set Security to None.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.

ArielGlenn closed subtask T110888: redo dumps monitor so it runs as a service as Resolved.Jan 13 2016, 4:12 AM

What's left:

* clean up the cron classes in the last commit https://gerrit.wikimedia.org/r/#/c/263807/ so that all conf file and other dependencies are called out
* make the invocations do a pgrep to make sure no previous instance is still running, etc
* make sure we have all dump stages lists required for these cron jobs
* document the intended use of the script since in about 5 minutes I'll forget why I wrote it that way
* remove extra whitespace from script, any other small cleanup
* check that we do the right thing in case a new wiki has never been dumped but all other wikis have completed their run for the month (the right thing in this case is not to run anything)
* check that we do the right thing in case a wiki dump run failed in its run for the month (the right thing is no new dump dir creation but rerunning all wikis for the existing run, those that are complete will be skipped and only the failed/missing steps on the one wiki rerun)
* redirect output to /dev/null, we already log. otherwise the spam mail messages will be ginormous

After all that, enable the jobs. They won't run until next month, and they will be for the full monthly run only. We can add the partial run to cron when we have the replacement eqiad hardware; right now I have to juggle the steps of the second monthly run manually to get everything done in time.

Nemo_bis subscribed.Jan 31 2016, 9:53 PM

These jobs are now all enabled. The first attempt to run will be Feb 2 early in the morning. I'll be checking to make sure everything started properly.

One catch is that, because more than one hosts does the dumps of the 'regular' (not en wikipedia) wikis, if one host is still running a previous dump run but the other has completed it, we can have problems. Specifically, one host will prepare all unlocked wikis for the new run, but it won't be able to prepare the wikis with dumps in progress. The second host, when it finally completes its jobs from the previous dump run, will run the cron job and try to prepare the wikis again with a different start date. This is not a blocker, but it does mean I need to run the second monthly run manually and watch its completion, til that is fixed up.

I should have updated this this morning. Anyways, cron jobs didn't start up the dumps because of a silly typo in the script. Fixed here: https://gerrit.wikimedia.org/r/#/c/267843/

Small/big wiki dumps were hung on start of dump run for labtestwiki which can't be dumped from snapshots. I added that to the list of dbs to skip and those jobs are now rolling. https://gerrit.wikimedia.org/r/#/c/268064/

Third time's a charm, after a rewrite, adding a dryrun option and a ton more testing on all the hosts of the dumps cron script we should be set to pick up tomorrow where we left off last night. This morning I found that small/big wikis had started from scratch with a new day's run, absolutely not desired behavior. Part ofhtat was my not deploying the current versions of the dumpadmin.py and WikiDump.py files over to the snapshot hosts for small/bin/wikis. At any rate, that's all been cleaned up, the rewrite of the script is here and has been merged: https://gerrit.wikimedia.org/r/#/c/268433/

And the saga goes on. pgrep works differently depending on whether the script you're searching for was run with /bin/bash scriptname or not. Grrrrr! So now I get all running instances and check for those that don't have the pgroup of the script doing the checking. Latest change (tested as thoroughly as I could with and without the cron job already running): https://gerrit.wikimedia.org/r/#/c/268659/

Sure hope this is the last of it.

Why do I forget to log success here? Anyways the jobs are humming along so I can finally close this. No more screen sessions!

ArielGlenn moved this task from Active to Done on the Dumps-Generation board.Feb 9 2016, 3:48 PM

Meno25 unsubscribed.Sep 26 2016, 2:35 PM

Make dumps run via cron on each snapshot hostClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Make dumps run via cron on each snapshot host
Closed, ResolvedPublic
Actions

Related Objects
Search...