Page MenuHomePhabricator

JobRunner database connection journey
Closed, ResolvedPublic

Description

I was tracking down some test failures in Cirrus and found some funkiness with the job runners. First, the failure: some pages aren't being added to the index. This is quite reproduceable locally. Anyway, you reproduce by creating a couple of pages quickly one linking to another. Like this:
Page A -> Page C, Page B -> Page C, Page C.
That's right, you are creating red links and un-red-ing them. This causes this funky log message:

[CirrusSearch] Ignoring an update for a nonexistent page: Page C

That page exists. I know it exists. My job is triggered after the LinksUpdate phase of page creation. That succeeded. I saw it in the logs! The SQL and all. Ohhh. the SQL.

Watch this:

manybubbles@manybubbles-laptop:~/Workspaces/vagrant$ grep WikiPage::insertOn\\\|nonexistent\\\|WikiPage::pageData logs/mediawiki-cirrustestwiki-debug.log| grep 'Page[_ ]C'
Query cirrustestwiki (15) (slave): INSERT /* WikiPage::insertOn Admin */ IGNORE INTO `page` (page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES (NULL,'0','Page_C','','0','1','0.182385825401','20150529190511','0','0')
Query cirrustestwiki (13) (slave): SELECT /* WikiPage::pageData 127.0.0.1 */  page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model  FROM `page`   WHERE page_namespace = '0' AND page_title = 'Page_C'  LIMIT 1
[CirrusSearch] Ignoring an update for a nonexistent page: Page C

So the job isn't seeing the page! It really isn't. And the clue is in the sequence numbers. They aren't in order. The job runner gets its own database connection - obviously. Its a different process. Its the one that makes the WikiPage::pageData query and gets nothing. The web process does WikiPage::insertOn. Anyway, if you trace the job runner process back back back back you see:

Query cirrustestwiki (1) (slave): BEGIN /* DatabaseBase::query (User::loadFromDatabase)  */

Long, long before Page C is created. That's right, its our friend REPEATABLE_READ, MySQL's default isolation level come to play!

So, you can fix it by making the job runner never process more than one job at a time but that's not really a good idea. I suspect the job runner should pitch its db connection or at least ROLLBACK its transaction.

I imagine holding a transaction open for ~30 seconds like this isn't good for MySQL either.

Stakeholder: (1) Cirrus Engineers, (2) Everyone
Benefit: (1) The Cirrus integration tests become tons more stable and (2) all jobs become more correct everywhere - meaning we are less likely to leave the cluster in a weird state. Updates are more likely to correctly hit the search index.
Estimate: 1 day

Event Timeline

Manybubbles raised the priority of this task from to Needs Triage.
Manybubbles updated the task description. (Show Details)
Manybubbles added a project: WMF-JobQueue.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 29 2015, 7:57 PM
Manybubbles triaged this task as Normal priority.May 29 2015, 7:59 PM
Manybubbles updated the task description. (Show Details)
Manybubbles set Security to None.

I looked into this a bit more - it looks like the job runner intentionally holds transaction open. See $wgJobSerialCommitThreshold which kind of scares me. I don't see how you can in good concience report a job as successful before you've committed it. Maybe the JobRunner handles it and I haven't traced that part through.

aaron added a comment.May 29 2015, 9:29 PM

I looked into this a bit more - it looks like the job runner intentionally holds transaction open. See $wgJobSerialCommitThreshold which kind of scares me. I don't see how you can in good concience report a job as successful before you've committed it. Maybe the JobRunner handles it and I haven't traced that part through.

What code are you reading? I'm looking at:

				try {
					++$jobsRun;
					$status = $job->run();
					$error = $job->getLastError();
					$this->commitMasterChanges( $job );

					DeferredUpdates::doUpdates();
					$this->commitMasterChanges( $job );
				} catch ( Exception $e ) {
					MWExceptionHandler::rollbackMasterChangesAndLog( $e );
					$status = false;
					$error = get_class( $e ) . ': ' . $e->getMessage();
					MWExceptionHandler::logException( $e );
				}

Note that $jobsRun should really be called $jobAttempted, and is only used for limiting. $status is used to determine success, which is false in the commit DBError case. The serial commit thing either finishes or causes a rollback in the above code, so it's not hold the trx across jobs or anything.

aaron added a comment.EditedMay 29 2015, 9:32 PM

Anyway, there is the question of slave connections, which indeed do not commit between jobs. It makes sense to end those transaction between jobs to cut down on snapshot lag (which is slave lag + jobrunner lifetime, e.g. ~1 min or so). Likewise with master transactions that did not writes.

I looked into this a bit more - it looks like the job runner intentionally holds transaction open. See $wgJobSerialCommitThreshold which kind of scares me. I don't see how you can in good concience report a job as successful before you've committed it. Maybe the JobRunner handles it and I haven't traced that part through.

What code are you reading?

Looks like I was just reading the documentation for `$wgJobSerialCommitThreshold```. Rereading the code it looks like it can cause the job runner to wait for slaves to catch up but otherwise every job is committed like you'd expect. So it looks like it'd be safe to trash the db connections between jobs or at least force a commit or rollback so you get a new snapshot. I'm not really sure what the right thing to do in MySQL with out replication setup though.

Change 214772 had a related patch set uploaded (by Manybubbles):
WIP: Commit all connections after each job

https://gerrit.wikimedia.org/r/214772

Added to Cirrus sprint because its causing us trouble. I won't be able to work on it today though - maybe tomorrow.

Change 214772 merged by jenkins-bot:
Commit all connections after each job

https://gerrit.wikimedia.org/r/214772

Manybubbles closed this task as Resolved.Jun 3 2015, 2:52 PM
Manybubbles moved this task from in progress to Done on the Discovery-Search (Current work) board.