Page MenuHomePhabricator

Fatal "cannot perform this operation with arrays" from CirrusSearch/ElasticaWrite (using JobQueueDB)
Closed, ResolvedPublicPRODUCTION ERROR

Description

some of my json blobs are larger than the the max allowed size of a mysql blob (job_params is a blob column):

2016-01-20 18:09:16 cirrusSearchElasticaWrite Q14475 clientSideTimeout= method=sendData arguments=array(3) cluster=default createdAt=1452822217 errorCount=1 retryCount=0 (id=139253,timestamp=20160115014337) t=593 good

Notice: Unable to unserialize: [a:7:{s:17:"clientSideTimeout";N;s:6:"method";s:8:"sendData";s:9:"arguments";a:3:{i:0;s:7:"content";i:1;a:10:{i:0;O:15:"Elastica\Script":5:{s:24:"]. Unexpected end of buffer during unserialization. in /var/www/wiki/w/includes/jobqueue/JobQueueDB.php on line 803

Fatal error: Invalid operand type was used: cannot perform this operation with arrays in /var/www/wiki/w/extensions/CirrusSearch/includes/Job/ElasticaWrite.php on line 45

i have a lot of wikidata stuff that i am putting in and somehow would be nice if the code could cope with this better. (or make job_params bigger)

this might not affect production which doesn't use JobQueueDB

Event Timeline

aude raised the priority of this task from to Medium.
aude updated the task description. (Show Details)
aude added projects: CirrusSearch, Wikidata.
aude subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

the job is batching 10 updates (e.g. 10 cirrus documents for wikibase items) and the size of these together is too much for a mysql blob :/

maybe the batch could be smaller / configurable and/or job_params could be mediumblob.

When will this bug get fix? Using Redis instead of mysql to store jobs now.
However, We can't afford a large memory, which made jobs stored in Redis been truncated when reaching memory cap (2,000 to 3,000 jobs).

It is not immediately obvious how to install Redis for this scenario. For my setup, on CentOS 7, I had to do the following:

sudo yum install redis php70u-pecl-redis
sudo systemctl start redis.service
sudo systemctl enable redis.service

Then, per the Redis setup docs, I added the following to my LocalSettings.php.
https://www.mediawiki.org/wiki/Redis#Job_queue

$wgJobTypeConf['default'] = [
    'class' => 'JobQueueRedis'
    , 'redisServer' => '127.0.0.1:6379'
    , 'redisConfig' => []
    , 'claimTTL' => 3600
    , 'daemonized' => true
];

NOTE: The daemonized parameter is required, see: https://www.mediawiki.org/w/index.php?title=Topic:Ss9ues5n7gtctppm&topic_showPostId=tbv2820lgm06ipbo#flow-post-tbv2820lgm06ipbo

At this point I expected my setup to be functional again, however it was not. I now received errors when searching. Maybe I didn't give it long enough, but I waited a few minutes and restarted apache with no luck. Finally I just re-indexed per the CirrusSearch README: https://phabricator.wikimedia.org/diffusion/ECIR/browse/master/README

I have tested that pre-existing content as well as new content now searches as expected without any errors.

mobrovac subscribed.

Removing WMF-JobQueue as we don't use JobQueueDB in production.

Why does the job itself have all of the transformed text rather than just a revision/page ID and use them to derive the transformed text? I get that some metadata is not stored elsewhere and would have to go in the job.

Krinkle moved this task from Untriaged to Jan2020/1.35-wmf.14 on the Wikimedia-production-error board.

Error

Request ID: fe2ef475a4166ddf6d2b9afa

message
PHP Notice: Unable to unserialize: [a:9:{s:6:"method";s:8:"sendData";s:9:"arguments";a:2:{i:0;s:7:"content";i:1;a:1:{i:0;a:3:{s:4:"data";a:23:{s:7:"version"……;i:10]. Unexpected end of buffer during unserialization.
trace
#0 /srv/mediawiki/php-1.33.0-wmf.22/includes/jobqueue/JobQueueDB.php(858): MWExceptionHandler::handleError(integer, string, string, integer, array, array)
#1 /srv/mediawiki/php-1.33.0-wmf.22/includes/jobqueue/JobQueueDB.php(312): JobQueueDB::extractBlob(string)
#2 /srv/mediawiki/php-1.33.0-wmf.22/includes/jobqueue/JobQueue.php(377): JobQueueDB->doPop()
#3 /srv/mediawiki/php-1.33.0-wmf.22/includes/jobqueue/JobQueueGroup.php(260): JobQueue->pop()
#4 /srv/mediawiki/php-1.33.0-wmf.22/includes/jobqueue/JobRunner.php(167): JobQueueGroup->pop(integer, integer, array)
#5 /srv/mediawiki/php-1.33.0-wmf.22/maintenance/runJobs.php(90): JobRunner->run(array)
#6 /srv/mediawiki/php-1.33.0-wmf.22/maintenance/doMaintenance.php(94): RunJobs->execute()
#7 /srv/mediawiki/php-1.33.0-wmf.22/maintenance/runJobs.php(126): include(string)

Impact

Whichever jobs these represented are unable to be run, this means some jobs (e.g. e-mails, notifications, or derived data updates like category membership, page links etc.) are not recorded on wikitech.wikimedia.org.

Notes

New regression in 1.33.wmf-22 it appears. No reports of this from before the branch went out.

Krinkle renamed this task from Notice: Unable to unserialize job_params in some CirrusSearch jobs (when using JobQueueDB) to Fatal "cannot perform this operation with arrays" from CirrusSearch/ElasticaWrite (using JobQueueDB).Mar 30 2019, 3:23 AM
Krinkle added subscribers: EBernhardson, debt, GTirloni.

Still seen. Causing some search jobs to fail for wikitech.wikimedia.org.

error
[c032e62f71eb06fbe34c1b7a] /srv/mediawiki/multiversion/MWScript.php   PHP Fatal Error from line 79 of /srv/mediawiki/php-1.33.0-wmf.22/extensions/CirrusSearch/includes/Job/ElasticaWrite.php: Invalid operand type was used: cannot perform this operation with arrays
trace
#0 [internal function]: MWExceptionHandler::handleFatalError()
#1 {main}

The two available fixes are a complete rewrite of the cirrussearch indexing retry pipeline, or changing the job queue to use a non-size limited field type [..]

Those are two ways to actually build the behaviour that the code currently pretends exists. That's cool, but the immediate fix is to not produce a fatal error.

A fatal error signifies that the code is broken, raises our error levels, and may abort a deployment, or trigger Ops pages.

In this case, however, it is known that this ability doesn't exist. Until this ability exists, the code either needs to be disabled (e.g. not deployed on Wikitech), or the code needs to handle this error and respond in some way. E.g. avoid queuing updates of this type or this size (possibly configurable), or run them differently, or to try it as today and then catch/suppress the failure - maybe logging a warning in its stead.

Until this ability exists, the code either needs to be disabled (e.g. not deployed on Wikitech), or the code needs to handle this error and respond in some way. E.g. avoid queuing updates of this type or this size (possibly configurable), or run them differently,

What you just described is option 1, rewrite the indexing retry pipeline. If you think turning off CirrusSearch on wikitech is the best alternative I can float that to wikitech list, but perhaps unsurprisingly I don't prefer that option.

E.g. avoid queuing updates of this type or this size (possibly configurable), or run them differently, or to try it as today and then catch/suppress the failure - maybe logging a warning in its stead.

Imo the JobQueue should raise an error if it's not able to save the message correctly. Since the Queue owns the way the message is serialized it's hard for an extension to determine what will be the actual size of the stored message.

Change 500481 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/core@master] Change job table params from blob to mediumblob

https://gerrit.wikimedia.org/r/500481

Change 500481 merged by jenkins-bot:
[mediawiki/core@master] Change job table params from blob to mediumblob

https://gerrit.wikimedia.org/r/500481

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM