Page MenuHomePhabricator

Exception thrown while running DataSender::sendData in cluster codfw: Data should be a Document, a Script or an array containing Documents and/or Scripts
Closed, ResolvedPublic

Description

Seen in logstash :

#0 /srv/mediawiki/php-1.31.0-wmf.27/extensions/CirrusSearch/includes/DataSender.php(192): Elastica\Bulk->addData(array, string)
#1 [internal function]: CirrusSearch\DataSender->sendData(string, array)
#2 /srv/mediawiki/php-1.31.0-wmf.27/extensions/CirrusSearch/includes/Job/ElasticaWrite.php(87): call_user_func_array(array, array)
#3 /srv/mediawiki/php-1.31.0-wmf.27/extensions/CirrusSearch/includes/Job/Job.php(98): CirrusSearch\Job\ElasticaWrite->doJob()
#4 /srv/mediawiki/php-1.31.0-wmf.27/extensions/EventBus/includes/JobExecutor.php(59): CirrusSearch\Job\Job->run()
#5 /srv/mediawiki/rpc/RunSingleJob.php(79): JobExecutor->execute(array)
#6 {main}

The cluster param is set to codfw so I bet it's a retried job due to a frozen cluster. At this time I was restarting elastic in codfw.

We need to investigate why we receive such errors.
The data sent to data->sendData() is taken from $this->params['arguments'].
The retried retried job is created by cloning itself : $job = clone $this;

Event Timeline

Also seeing:

Failed executing job: cirrusSearchElasticaWrite VisualEditor/Diffs arguments=["content",[[]]] cluster=codfw createdAt=1522329753 doNotRetry= errorCount=2 jobReleaseTimestamp=1522334866 method=sendData requestId=WrzoWQpAAE4AAFyMoWoAAABL retryCount=9

The value inside arguments is weird.

Is this something that was happening before but was not logged, or is that a new error due to the transition to EventBus?

I think it's new I see it only for the wikis we switched over:
other wikis
vs
mediawikiwiki and test wikis

Ok, I think I know what happened here.

The elasticaWrite job does it's own retrying, and in the old queue respected allowRetries = false set in the job, so even though the job reported as failed, it wasn't logged/retried in the old queue.

In the new queue we log all the job failures, because we're mostly in the testing mode. So we need to change either of the things:

  1. We could change the ElasticaWrite job to report success if the retry job was successfully enqueued and only report error if some unexpected exception happened
  2. We could not log the error in the JobExecutor if the allowRetries is false, and treat an error as success, but that seems a bit convoluted.

I personally vote for option 1. In that case, the success of the Job execution would mean that it either successfully written to Elasetic or it successfully posted a retry message. That will leave false to some unexpected disastrous errors in which case ChangeProp will attempt to retry an original job. This will also not affect how the old queue works.

I can return true to tell the job executor that we succeeded but I still don't understand why the arguments in the params of the job that has been retried is not correct.
Can there be serialization issues?
I think the job holds a list of instances of the class \Elastica\Script\Script in its params is that supported with the new jobqueue?

I think the job holds a list of instances of the class \Elastica\Script\Script in its params is that supported with the new jobqueue?

Ulala.. Per T161647 the new queue tries to serialize the job params into JSON, not use the PHP serialize/deserialize. For most of the job it's a drop-in replacement, cause the job params are just strings/numbers.

Does other Cirrus job use some other non-trivial types in params or the ElasticaWrite is the only one?

I have to check but I hope not.
I guess we have to fix that and handle the serialization on our own.

EBjune triaged this task as Medium priority.Mar 29 2018, 5:06 PM

I guess we have to fix that and handle the serialization on our own.

I think so. The CirrusSearch jobs extend the deprecated Job class from Mediawiki while the new JobSpecification class actually doesn't allow non-JSON-serializable parameters.

So it seems the best way forward is to modify the Cirrus search jobs.

Change 424039 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Restrict DataSender::sendData to Documents

https://gerrit.wikimedia.org/r/424039

Change 424040 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Convert ElasticaWrite job to use json compatible params

https://gerrit.wikimedia.org/r/424040

Change 424039 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Restrict DataSender::sendData to Documents

https://gerrit.wikimedia.org/r/424039

Change 424040 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Convert ElasticaWrite job to use json compatible params

https://gerrit.wikimedia.org/r/424040

Change 427827 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Revert "Convert ElasticaWrite job to use json compatible params"

https://gerrit.wikimedia.org/r/427827

Change 427893 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427893

Change 427893 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427893

Please ignore, unrelated patch. I tagged the wrong bug when sending this patch.

Change 427827 abandoned by EBernhardson:
Revert "Convert ElasticaWrite job to use json compatible params"

Reason:
turned out to not be this patch causing problems.

https://gerrit.wikimedia.org/r/427827

I believe the fix for it has been deployed and we can try to proceed with switching cirrus search for some more wikis?

@Pchelolo yes I believe so as well.
We might want to test more wikis or all of them perhaps?
We have a cluster restart in progress that triggers write freezes (using the jobqueue to store jobs). It can be interesting to see how it reacts in such conditions?

We might want to test more wikis or all of them perhaps?

Right now we're running Kafka queue on test wikis and MediaWiki wiki. I don't think we should switch all of them right away, maybe choose a couple projects with a bit more traffic for the next iteration?

We have a cluster restart in progress that triggers write freezes (using the jobqueue to store jobs). It can be interesting to see how it reacts in such conditions?

I can see that from the graphs actually. It's not entirely clear how exactly it reacts, because it's only enabled for super-low-traffic wikis.

I think we can resolve this task and move the conversation to the original T189137