Page MenuHomePhabricator

Fatal error during CirrusSearch-LinksUpdate job (CirrusTitleJob) from JobQueueGroup->push
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

Request ID: 86c45a3738404b7a666307f2

message
JobQueueError:
Could not enqueue jobs: Unable to serialize events
trace
#0 /srv/mediawiki/php-1.34.0-wmf.8/includes/jobqueue/JobQueue.php(342): JobQueueEventBus->doBatchPush(array, integer)
#1 /srv/mediawiki/php-1.34.0-wmf.8/includes/jobqueue/JobQueue.php(312): JobQueue->batchPush(array, integer)
#2 /srv/mediawiki/php-1.34.0-wmf.8/includes/jobqueue/JobQueueGroup.php(162): JobQueue->push(array)
#3 /srv/mediawiki/php-1.34.0-wmf.8/extensions/CirrusSearch/includes/Job/ElasticaWrite.php(238): JobQueueGroup->push(array)
#4 /srv/mediawiki/php-1.34.0-wmf.8/extensions/CirrusSearch/includes/Job/ElasticaWrite.php(159): CirrusSearch\Job\ElasticaWrite->requeueError(CirrusSearch\Connection)
#5 /srv/mediawiki/php-1.34.0-wmf.8/extensions/CirrusSearch/includes/Job/JobTraits.php(137): CirrusSearch\Job\ElasticaWrite->doJob()
#6 /srv/mediawiki/php-1.34.0-wmf.8/extensions/CirrusSearch/includes/Updater.php(231): CirrusSearch\Job\CirrusGenericJob->run()
#7 /srv/mediawiki/php-1.34.0-wmf.8/extensions/CirrusSearch/includes/Updater.php(90): CirrusSearch\Updater->updatePages(array, integer)
#8 /srv/mediawiki/php-1.34.0-wmf.8/extensions/CirrusSearch/includes/Job/LinksUpdate.php(51): CirrusSearch\Updater->updateFromTitle(Title)
#9 /srv/mediawiki/php-1.34.0-wmf.8/extensions/CirrusSearch/includes/Job/JobTraits.php(137): CirrusSearch\Job\LinksUpdate->doJob()
#10 /srv/mediawiki/php-1.34.0-wmf.8/extensions/EventBus/includes/JobExecutor.php(66): CirrusSearch\Job\CirrusTitleJob->run()
#11 /srv/mediawiki/rpc/RunSingleJob.php(77): JobExecutor->execute(array)

Impact

(To be determined). Likely part of the work CirrusTitleJob is meant to do is skipped without obvious recovery.

Notes

Seems to have started today, with 1.34-wmf.8.

Logstash query: +message:"Unable to serialize events" AND exception.trace:"CirrusTitleJob"

Event Timeline

dcausse triaged this task as Unbreak Now! priority.
dcausse moved this task from needs triage to Current work on the Discovery-Search board.
dcausse lowered the priority of this task from Unbreak Now! to Medium.Jun 6 2019, 1:09 PM

This error is not new, it's just that we changed some class names (message:"Unable to serialize events" AND exception.trace:"CirrusSearch"). Removing from deployment blocker.
I think it's caused by something similar to T184698

Krinkle renamed this task from Fatal error during CirrusTitleJob from JobQueueGroup->push to Fatal error during CirrusSearch-LinksUpdate job (CirrusTitleJob) from JobQueueGroup->push.Jun 8 2019, 2:43 PM

If I understand correctly, this regression started with the transition for the JobQueue from Redis (in which we stored serialised PHP), to EventBus/Kafka (in which we store JSON).

It seems the way EventBus encodes these parameters in JSON format, is are stricter about character encoding than MediaWiki itself.

For this particular task, it is about cirrusSearchLinksUpdate jobs (CirrusSearch\Job\LinksUpdate) that fail to execute an ad hoc "CirrusSearch\Job\ElasticaWrite" job (instantiated and run without being queued first), and upon that failure, tries to queue itself for retry. During this queueing the serialisation step in EventBus fails due to the job parameters for "CirrusSearch\Job\ElasticaWrite" containing bytes it does not accept in its serialisation format.

This could be resolved in a number of ways:

  • If these invalid values come from a specific source external to CirusSearch that is known to not produce new such values, but has known-invalid values from a past version of itself (e.g. corrupt database rows for old comments only). Then that could be fixed by a one-off maintenance script that we run in prod and upon upgrades for third-parties.
  • If these values aren't logically invalid within MW, and CirrusSearch does something somewhat unusual with them, then CirusSearch may need to detect, transform or skip these in some way so that they can play along with EventBus.
  • If these values aren't logically invalid within MW and expected as parameters for jobs in other extensions as well, then it may be an issue for EventBus to find a way to store or tolerate these in a way that doesn't result in loss of jobs.

Checked few examples and there seem to be a mix of

  • LiquidThread pages
  • old pages (generally CJK languages) that do not seem to have been updated in a while (~2004)

Example of page id 220386 on jawiki:

Loading the data from from mwmaint1002 I see invalid utf8 sequences in the wikitext:

$p = \WikiPage::factory( Title::newFromID(220386 ) );
print( $p->getContentModel() . "\n" );
$po = $p->getContentHandler()->getParserOutputForIndexing( $p );
$ar = $p->getContentHandler()->getDataForSearchIndex( $p, $po, new \CirrusSearch\CirrusSearch( ) );
foreach ( $ar as $k => $v ) { if ( is_string( $v ) && !mb_check_encoding( $v, 'UTF-8' ) ) print ("BAD $k " . gettype( $v ) . "\n"); else print( "OK $k\n"); }
wikitext
OK category
OK external_link
OK outgoing_link
OK template
OK text
BAD source_text string
OK text_bytes
OK content_model
OK coordinates
OK language
OK heading
OK opening_text
OK auxiliary_text
OK defaultsort

But what I don't understand yet with this example is that we indexed this doc in the past as it is available in elasticsearch: https://ja.wikipedia.org/wiki/Wikipedia:%E5%89%8A%E9%99%A4%E8%A8%98%E9%8C%B2/%E9%81%8E%E5%8E%BB%E3%83%AD%E3%82%B0_2004%E5%B9%B411%E6%9C%88?action=cirrusDump
The the source_text is not available which means we somehow were able to send this document partially to elasticsearch... This is puzzling me as I don't see nor remember anything in the code that could have sent partial docs to elastic that could explain the presence of this data inside elasticsearch...

Tested various PHP versions hoping to find an explanation in vain.
Moving forward I think I'll add some temporary code in cirrus to detect invalid UTF-8 sequences in the source text so that we can flag and fix such pages in the elasticsearch index. Since Cirrus is refreshing its index constantly we might be able to use its data to filter the list of non-deleted pages that contain invalid UTF8 sequences in their text content.

Change 517083 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Investigate UTF-8 issues by tagging pages with problems

https://gerrit.wikimedia.org/r/517083

Change 517083 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Investigate UTF-8 issues by tagging pages with problems

https://gerrit.wikimedia.org/r/517083

Change 520185 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Fix logging of invalid UTF-8 bytes in source_text

https://gerrit.wikimedia.org/r/520185

Change 520263 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Detect and track invalid UTF8 sequences in more fields

https://gerrit.wikimedia.org/r/520263

Change 520185 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Fix logging of invalid UTF-8 bytes in source_text

https://gerrit.wikimedia.org/r/520185

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:06 PM

Change 520263 abandoned by DCausse:
Detect and track invalid UTF8 sequences in more fields

Reason:
Does not seem to be needed

https://gerrit.wikimedia.org/r/520263

Moving to done, the number of non deleted pages that contain invalid sequence is extremely small, over all the production wikis there are: