Page MenuHomePhabricator

Bulk update failures: "id is missing"
Closed, ResolvedPublic

Description

Since Jul 6 at 9PM UTC a lot of updates appear to fail.

In the logs we can see:

 2017-07-07 12:01:00 [1375ed096c43ceaf4057c9dd] mw1163 commonswiki 1.30.0-wmf.7 CirrusSearchChangeFailed WARNING: Update for doc ids: 5697776,5697776,5697776,5697776,15126022,,,3804380,3804380,8253703  
[Exception Elastica\Exception\ResponseException] (/srv/mediawiki/php-1.30.0-wmf.7/vendor/ruflin/elastica/lib/Elastica/Transport/Http.php:178) Validation Failed: 1: id is missing;2: id is missing;
  #0 /srv/mediawiki/php-1.30.0-wmf.7/vendor/ruflin/elastica/lib/Elastica/Request.php(171): Elastica\Transport\Http->exec(Elastica\Request, array)
  #1 /srv/mediawiki/php-1.30.0-wmf.7/vendor/ruflin/elastica/lib/Elastica/Client.php(662): Elastica\Request->send()
  #2 /srv/mediawiki/php-1.30.0-wmf.7/vendor/ruflin/elastica/lib/Elastica/Bulk.php(357): Elastica\Client->request(string, string, string, array)
  #3 /srv/mediawiki/php-1.30.0-wmf.7/extensions/CirrusSearch/includes/DataSender.php(189): Elastica\Bulk->send()
  #4 [internal function]: CirrusSearch\DataSender->sendData(string, array, string)
  #5 /srv/mediawiki/php-1.30.0-wmf.7/extensions/CirrusSearch/includes/Job/ElasticaWrite.php(83): call_user_func_array(array, array)
  #6 /srv/mediawiki/php-1.30.0-wmf.7/extensions/CirrusSearch/includes/Job/Job.php(98): CirrusSearch\Job\ElasticaWrite->doJob()
  #7 /srv/mediawiki/php-1.30.0-wmf.7/includes/jobqueue/JobRunner.php(293): CirrusSearch\Job\Job->run()
  #8 /srv/mediawiki/php-1.30.0-wmf.7/includes/jobqueue/JobRunner.php(193): JobRunner->executeJob(CirrusSearch\Job\ElasticaWrite, Wikimedia\Rdbms\LBFactoryMulti, BufferingStatsdDataFactory, integer)
  #9 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array)
  #10 {main}

It looks like we send bulk updates where some docs lack a doc id.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse renamed this task from Bulk update failures: missing ids to Bulk update failures: "id is missing".Jul 7 2017, 12:11 PM

@Smalyshev I see that you're running a forceSearchIndex for archive, is it possible that you send Documents with missing/null ids?

@dcausse I don't see any reason why but I'll read the code and see if it can happen. IIRC id uses is a page id, so it should never be missing, but maybe something is wrong.

forceSearchIndex does seem like the most likely culprit: They are full batches of 10 docs (our largest batch size) which pretty much can only be from bulk indexing. Additionally there is a 'reqId' field in logging that is generated in the process that creates jobs and is reported when the jobs error, grabbing a random reqId from a failure (reqId:49fb1b3b07efb0c78330b3b1) there are thousands of events. I'm not sure though if its archive or regular search, although archive seems plausible since it's new.

But I wonder are those from archive reindex or from regular reindex? Is there any way to know which wiki they relate to, etc.? Then I could correlate the logs for that wiki's indexing and maybe see what it was doing at the time?

logstash shows which wikis in the 'wiki' field. Here is a url for the standard dashboard searching for "validation failed" and then the top wikis visualization added: https://logstash.wikimedia.org/goto/9f6e1cb1b42a3d49a8dd56f20d1265d9

Since the last 15 minutes contains plenty of wikis that all start with s and the reindex runs in alphabetical order, it looks like the error happens for almost all wikis :S

The main reindexing happens inside elasticsearch rather than from CirrusSearch jobs, so it can't be that. It could be the forceSearchIndex calls that try and catch up any writes that happened while the reindex was running, or it could be the call to forceSearchIndex using --archive.

To narrow things down i took a reqId from one of the failures, and narrowed down to logs from terbium that include an exception trace with the query: "reqId:11e0b8d7b31487837191ed10 AND host:terbium AND _exists_:exception.trace"

Looking at a couple messages they all have similar exception traces:

#0 /srv/mediawiki/php-1.30.0-wmf.7/vendor/ruflin/elastica/lib/Elastica/Request.php(171): Elastica\Transport\Http->exec(Elastica\Request, array)
#1 /srv/mediawiki/php-1.30.0-wmf.7/vendor/ruflin/elastica/lib/Elastica/Client.php(662): Elastica\Request->send()
#2 /srv/mediawiki/php-1.30.0-wmf.7/vendor/ruflin/elastica/lib/Elastica/Bulk.php(357): Elastica\Client->request(string, string, string, array)
#3 /srv/mediawiki/php-1.30.0-wmf.7/extensions/CirrusSearch/includes/DataSender.php(189): Elastica\Bulk->send()
#4 [internal function]: CirrusSearch\DataSender->sendData(string, array, string)
#5 /srv/mediawiki/php-1.30.0-wmf.7/extensions/CirrusSearch/includes/Job/ElasticaWrite.php(83): call_user_func_array(array, array)
#6 /srv/mediawiki/php-1.30.0-wmf.7/extensions/CirrusSearch/includes/Job/Job.php(98): CirrusSearch\Job\ElasticaWrite->doJob()
#7 /srv/mediawiki/php-1.30.0-wmf.7/extensions/CirrusSearch/includes/Updater.php(280): CirrusSearch\Job\Job->run()
#8 /srv/mediawiki/php-1.30.0-wmf.7/extensions/CirrusSearch/maintenance/forceSearchIndex.php(198): CirrusSearch\Updater->archivePages(array, boolean)
#9 /srv/mediawiki/php-1.30.0-wmf.7/maintenance/doMaintenance.php(111): CirrusSearch\ForceSearchIndex->execute()
#10 /srv/mediawiki/php-1.30.0-wmf.7/extensions/CirrusSearch/maintenance/forceSearchIndex.php(608): require_once(string)
#11 /srv/mediawiki/multiversion/MWScript.php(99): require_once(string)
#12 {main}

The reason these messages come from both terbium and the job runners is that terbium will attempt to do the writes first. When the write fails it's enqueued into the job queue with a delay to retry.

I am looking at the code and page IDs for deletions come right from ar_page_id - but archive table e.g. for commonswiki does have entries with null page IDs. I'd say we have to ignore those probably.

Digging through git history i found where ar_page_id became nullable: https://github.com/wikimedia/mediawiki/commit/5a9878f32d491ecd561045595e8d4f9bfeefc257

The important part of the commit message is: * Make the field nullable for consistency with ar_rev_id and ar_text_id and between MySQL and PostgreSQL and generally so you can cleanly tell between bogus fields and ones that predate use of the field

This basically implies that we should not see any null ar_page_id after this was deployed some time in late 2007. Checking a few tables in mysql:

select max(ar_timestamp) from siwiki.archive where ar_page_id is null;
+-------------------+
| max(ar_timestamp) |
+-------------------+
| 20070822142005    |
+-------------------+
select max(ar_timestamp) from scwiktionary.archive where ar_page_id is null;
+-------------------+
| max(ar_timestamp) |
+-------------------+
| 20061025200141    |
+-------------------+
select max(ar_timestamp) from commonswiki.archive where ar_page_id is null;
+-------------------+
| max(ar_timestamp) |
+-------------------+
| 20070824134347    |
+-------------------+

I think it's safe to document that archive search won't work for any page that was deleted in 2007 or earlier, adjust the archive code to handle null page ids in the query that indexes history, and call it a day?

Change 363895 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/CirrusSearch@master] Ignore archive records with null page_id

https://gerrit.wikimedia.org/r/363895

Smalyshev triaged this task as Medium priority.Jul 7 2017, 10:02 PM
Smalyshev moved this task from needs triage to Current work on the Discovery-Search board.
Smalyshev moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

Change 363895 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Ignore archive records with null page_id

https://gerrit.wikimedia.org/r/363895

Change 363996 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/CirrusSearch@wmf/1.30.0-wmf.7] Ignore archive records with null page_id

https://gerrit.wikimedia.org/r/363996

Change 363996 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.30.0-wmf.7] Ignore archive records with null page_id

https://gerrit.wikimedia.org/r/363996

Mentioned in SAL (#wikimedia-operations) [2017-07-10T19:14:10Z] <niharika29@tin> Synchronized php-1.30.0-wmf.7/extensions/CirrusSearch/: Ignore archive records with null page_id (T169977) (duration: 00m 52s)

Smalyshev claimed this task.