I was tracking down some test failures in Cirrus and found some funkiness with the job runners. First, the failure: some pages aren't being added to the index. This is quite reproduceable locally. Anyway, you reproduce by creating a couple of pages quickly one linking to another. Like this:
Page A -> Page C, Page B -> Page C, Page C.
That's right, you are creating red links and un-red-ing them. This causes this funky log message:
[CirrusSearch] Ignoring an update for a nonexistent page: Page C
That page exists. I know it exists. My job is triggered after the LinksUpdate phase of page creation. That succeeded. I saw it in the logs! The SQL and all. Ohhh. the SQL.
manybubbles@manybubbles-laptop:~/Workspaces/vagrant$ grep WikiPage::insertOn\\\|nonexistent\\\|WikiPage::pageData logs/mediawiki-cirrustestwiki-debug.log| grep 'Page[_ ]C' Query cirrustestwiki (15) (slave): INSERT /* WikiPage::insertOn Admin */ IGNORE INTO `page` (page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len) VALUES (NULL,'0','Page_C','','0','1','0.182385825401','20150529190511','0','0') Query cirrustestwiki (13) (slave): SELECT /* WikiPage::pageData 127.0.0.1 */ page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model FROM `page` WHERE page_namespace = '0' AND page_title = 'Page_C' LIMIT 1 [CirrusSearch] Ignoring an update for a nonexistent page: Page C
So the job isn't seeing the page! It really isn't. And the clue is in the sequence numbers. They aren't in order. The job runner gets its own database connection - obviously. Its a different process. Its the one that makes the WikiPage::pageData query and gets nothing. The web process does WikiPage::insertOn. Anyway, if you trace the job runner process back back back back you see:
Query cirrustestwiki (1) (slave): BEGIN /* DatabaseBase::query (User::loadFromDatabase) */
Long, long before Page C is created. That's right, its our friend REPEATABLE_READ, MySQL's default isolation level come to play!
So, you can fix it by making the job runner never process more than one job at a time but that's not really a good idea. I suspect the job runner should pitch its db connection or at least ROLLBACK its transaction.
I imagine holding a transaction open for ~30 seconds like this isn't good for MySQL either.
Stakeholder: (1) Cirrus Engineers, (2) Everyone
Benefit: (1) The Cirrus integration tests become tons more stable and (2) all jobs become more correct everywhere - meaning we are less likely to leave the cluster in a weird state. Updates are more likely to correctly hit the search index.
Estimate: 1 day