Page MenuHomePhabricator

Job cirrusSearchIncomingLinkCount failures "Read timeout is reached" from Title::getFirstRevision
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

Request ID: W7PV9ArAIFsAADtu0UIAAABX
Channel: JobExecutor

message
A database query timeout has occurred. 

Query: SELECT  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_text_id,rev_content_format,rev_content_model  FROM `revision` IGNORE INDEX (rev_timestamp) LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id))   WHERE rev_page = '#######'  ORDER BY rev_timestamp ASC, rev_id ASC LIMIT 1  

Function: Title::getFirstRevision
Error: 2062 Read timeout is reached
stacktrace
#1 /srv/mediawiki/php-1.32.0-wmf.23/includes/libs/rdbms/database/Database.php(1198): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.32.0-wmf.23/includes/libs/rdbms/database/Database.php(1655): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.32.0-wmf.23/includes/libs/rdbms/database/Database.php(1746): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#4 /srv/mediawiki/php-1.32.0-wmf.23/includes/Title.php(4429): Wikimedia\Rdbms\Database->selectRow(array, array, array, string, array, array)
#5 /srv/mediawiki/php-1.32.0-wmf.23/includes/page/WikiPage.php(705): Title->getFirstRevision()
#6 /srv/mediawiki/php-1.32.0-wmf.23/extensions/CirrusSearch/includes/Updater.php(342): WikiPage->getOldestRevision()
#7 /srv/mediawiki/php-1.32.0-wmf.23/extensions/CirrusSearch/includes/Updater.php(407): CirrusSearch\Updater::buildDocument(CirrusSearch, WikiPage, CirrusSearch\Connection, integer, integer, integer)
#8 /srv/mediawiki/php-1.32.0-wmf.23/extensions/CirrusSearch/includes/Updater.php(205): CirrusSearch\Updater->buildDocumentsForPages(array, integer)
#9 /srv/mediawiki/php-1.32.0-wmf.23/extensions/CirrusSearch/includes/Updater.php(468): CirrusSearch\Updater->updatePages(array, integer)
#10 /srv/mediawiki/php-1.32.0-wmf.23/extensions/CirrusSearch/includes/Job/IncomingLinkCount.php(36): CirrusSearch\Updater->updateLinkedArticles(array)
#11 /srv/mediawiki/php-1.32.0-wmf.23/extensions/CirrusSearch/includes/Job/Job.php(99): CirrusSearch\Job\IncomingLinkCount->doJob()
#12 /srv/mediawiki/php-1.32.0-wmf.23/extensions/EventBus/includes/JobExecutor.php(64): CirrusSearch\Job\Job->run()

Notes

Tentatively triaging as high priority because this appears to be a regression that started quite recently at a very specific point in time:

capture.png (859×2 px, 67 KB)

It started September 12 with the deployment of 1.32.0-wmf.20, it appears to be specific to CirrusSearch jobs for en.wikipedia.org

Event Timeline

Krinkle triaged this task as High priority.Oct 2 2018, 9:31 PM
Krinkle created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This affects job_type cirrusSearchLinksUpdatePrioritized as well. Identical stacktrace otherwise.

We started calling Revision::getFirstRevision (via WikiPage::getOldestRevision) in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CirrusSearch/+/433986/3/includes/Updater.php as part of T195071. Nothing should be new to this code path on the CirrusSearch side since that patch was deployed in late may.

My guess would be that either the query changed, or query planning changed. Git logs suggest the main query declaration hasn't changed much since at least 2017, although potentially joins requested from Revision::getQueryInfo() that i'm not familiar with could have changed the underlying query. A check in prod shows this is (roughly) the query issued:

SELECT rev_id, 
       rev_page, 
       rev_timestamp, 
       rev_minor_edit, 
       rev_deleted, 
       rev_len, 
       rev_parent_id, 
       rev_sha1, 
       Coalesce(comment_rev_comment.comment_text, rev_comment) AS `rev_comment_text`, 
       comment_rev_comment.comment_data                        AS  `rev_comment_data`, 
       comment_rev_comment.comment_id                          AS `rev_comment_cid`, 
       rev_user, 
       rev_user_text, 
       NULL                                                    AS `rev_actor`,
       rev_text_id, 
       rev_content_format, 
       rev_content_model 
FROM   `revision` IGNORE INDEX (rev_timestamp) 
       LEFT JOIN `revision_comment_temp` `temp_rev_comment` 
              ON (( temp_rev_comment.revcomment_rev = rev_id )) 
       LEFT JOIN `comment` `comment_rev_comment` 
              ON (( comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id )) 
WHERE  rev_page = '15580374' 
ORDER  BY rev_timestamp ASC, 
          rev_id ASC
LIMIT  1

And an explain (from db2072, experience says this can vary):

+------+-------------+---------------------+--------+------------------------------------------------+----------------+---------+-----------------------------------------------+------+-------------+
| id   | select_type | table               | type   | possible_keys                                  | key            | key_len | ref                                           | rows | Extra       |
+------+-------------+---------------------+--------+------------------------------------------------+----------------+---------+-----------------------------------------------+------+-------------+
|    1 | SIMPLE      | revision            | ref    | page_timestamp,page_user_timestamp,rev_page_id | page_timestamp | 4       | const                                         | 4193 | Using where |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                         | PRIMARY        | 4       | enwiki.revision.rev_id                        |    1 | Using index |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                        | PRIMARY        | 8       | enwiki.temp_rev_comment.revcomment_comment_id |    1 | Using where |
+------+-------------+---------------------+--------+------------------------------------------------+----------------+---------+-----------------------------------------------+------+-------------+

The joins are overkill for the CirrusSearch use case, we only actually want rev_timestamp, but since this has a limit 1 attached i wouldn't expect the join being expensive enough to matter. The choice of the page_timestamp (rev_page, rev_timestamp) index seems correct, innodb should be doing a lookup on rev_page and then a reverse scan of rev_timestamp to fetch the first revision.

Prior experience suggests perhaps different servers are choosing different query plans. I checked all s1 servers and they all come back with the expected page_timestamp index being used.

serverindex chosen
db2048 (master)page_timestamp
db2055page_timestamp
db2062page_timestamp
db2070page_timestamp
db2071page_timestamp
db2072page_timestamp
db2085page_timestamp
db2088page_timestamp
db2092page_timestamp

Overall I'm not sure yet, although for the purposes of CirrusSearch we could probably specialize a simpler query to return only the timestamp of the first revision.

Change 466814 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Updater: Pull only the timestamp of oldest revision

https://gerrit.wikimedia.org/r/466814

Attached patch doesn't really solve the goal of this patch, which seems to identify what happened in the sept 13 deploy to change the behaviour of a previously working query, but it does simplify the query used so we might stop seeing this error.

Change 466814 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Updater: Pull only the timestamp of oldest revision

https://gerrit.wikimedia.org/r/466814

I'm not seeing these errors generated anymore, willing to call it done.

debt claimed this task.
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM