Change Details

I was tracking down some test failures in Cirrus and found some funkiness with the job runners. First, the failure: some pages aren't being added to the index. This is quite reproduceable locally. Anyway, you reproduce by creating a couple of pages quickly one linking to another. Like this: Page A -> Page C, Page B -> Page C, Page C. That's right, you are creating red links and un-red-ing them. This causes this funky log message: ``` [CirrusSearch] Ignoring an update for a nonexistent page: Page C ``` That page exists. I know it exists. My job is triggered after the LinksUpdate phase of page creation. That succeeded. I saw it in the logs! The SQL and all. Ohhh. the SQL. Watch this: ``` manybubbles@manybubbles-laptop:~/Workspaces/vagrant$ grep WikiPage::insertOn\\\|nonexistent\\\|WikiPage::pageData logs/mediawiki-cirrustestwiki-debug.log| grep 'Page[_ ]C' Query cirrustestwiki (15) (slave): INSERT /* WikiPage::insertOn Admin */ IGNORE INTO `page` (page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES (NULL,'0','WeightedLink1432927448_2/1','','0','1','0.182385825401','20150529190511','0','0') Query cirrustestwiki (13) (slave): SELECT /* WikiPage::pageData 127.0.0.1 */ page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model FROM `page` WHERE page_namespace = '0' AND page_title = 'WeightedLink1432927448_2' LIMIT 1 [CirrusSearch] Ignoring an update for a nonexistent page: WeightedLink1432927448 2 ``` So the job isn't seeing the page! It really isn't. And the clue is in the sequence numbers. They aren't in order. The job runner gets its own database connection - obviously. Its a different process. Its the one that makes the WikiPage::pageData query and gets nothing. The web process does WikiPage::insertOn. Anyway, if you trace the job runner process back back back back you see: ``` Query cirrustestwiki (1) (slave): BEGIN /* DatabaseBase::query (User::loadFromDatabase) */ ``` Long, long before Page C is created. That's right, its our friend REPEATABLE_READ, MySQL's default isolation level come to play! So, you can fix it by making the job runner never process more than one job at a time but that's not really a good idea. I suspect the job runner should pitch its db connection or at least ROLLBACK its transaction. I imagine holding a transaction open for ~30 seconds like this isn't good for MySQL either.