Maniphest T191024

Exception thrown while running DataSender::sendData in cluster codfw: Data should be a Document, a Script or an array containing Documents and/or Scripts
Closed, ResolvedPublic
Actions

Description

Seen in logstash :

#0 /srv/mediawiki/php-1.31.0-wmf.27/extensions/CirrusSearch/includes/DataSender.php(192): Elastica\Bulk->addData(array, string)
#1 [internal function]: CirrusSearch\DataSender->sendData(string, array)
#2 /srv/mediawiki/php-1.31.0-wmf.27/extensions/CirrusSearch/includes/Job/ElasticaWrite.php(87): call_user_func_array(array, array)
#3 /srv/mediawiki/php-1.31.0-wmf.27/extensions/CirrusSearch/includes/Job/Job.php(98): CirrusSearch\Job\ElasticaWrite->doJob()
#4 /srv/mediawiki/php-1.31.0-wmf.27/extensions/EventBus/includes/JobExecutor.php(59): CirrusSearch\Job\Job->run()
#5 /srv/mediawiki/rpc/RunSingleJob.php(79): JobExecutor->execute(array)
#6 {main}

The cluster param is set to codfw so I bet it's a retried job due to a frozen cluster. At this time I was restarting elastic in codfw.

We need to investigate why we receive such errors.
The data sent to data->sendData() is taken from $this->params['arguments'].
The retried retried job is created by cloning itself : $job = clone $this;

Details

Subject	Repo	Branch	Lines +/-
Revert "Convert ElasticaWrite job to use json compatible params"	mediawiki/extensions/CirrusSearch	master	+28 -169
Convert ElasticaWrite job to use json compatible params	mediawiki/extensions/CirrusSearch	master	+169 -28
Restrict DataSender::sendData to Documents	mediawiki/extensions/CirrusSearch	master	+69 -55

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Pchelolo	T157088 [EPIC] Develop a JobQueue backend based on EventBus
Resolved	• Pchelolo	T190327 FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus
Resolved	• Pchelolo	T189137 Migrate CirrusSearch jobs to Kafka queue
Resolved	EBernhardson	T191024 Exception thrown while running DataSender::sendData in cluster codfw: Data should be a Document, a Script or an array containing Documents and/or Scripts

Event Timeline

dcausse created this task.Mar 29 2018, 1:49 PM

Restricted Application edited projects, added Discovery-ARCHIVED, Analytics, Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptMar 29 2018, 1:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Also seeing:

Failed executing job: cirrusSearchElasticaWrite VisualEditor/Diffs arguments=["content",[[]]] cluster=codfw createdAt=1522329753 doNotRetry= errorCount=2 jobReleaseTimestamp=1522334866 method=sendData requestId=WrzoWQpAAE4AAFyMoWoAAABL retryCount=9

The value inside arguments is weird.

dcausse updated the task description. (Show Details)Mar 29 2018, 2:54 PM

dcausse added a subscriber: • Pchelolo.

Is this something that was happening before but was not logged, or is that a new error due to the transition to EventBus?

I think it's new I see it only for the wikis we switched over:
other wikis
vs
mediawikiwiki and test wikis

Ok, I think I know what happened here.

The elasticaWrite job does it's own retrying, and in the old queue respected allowRetries = false set in the job, so even though the job reported as failed, it wasn't logged/retried in the old queue.

In the new queue we log all the job failures, because we're mostly in the testing mode. So we need to change either of the things:

We could change the ElasticaWrite job to report success if the retry job was successfully enqueued and only report error if some unexpected exception happened
We could not log the error in the JobExecutor if the allowRetries is false, and treat an error as success, but that seems a bit convoluted.

I personally vote for option 1. In that case, the success of the Job execution would mean that it either successfully written to Elasetic or it successfully posted a retry message. That will leave false to some unexpected disastrous errors in which case ChangeProp will attempt to retry an original job. This will also not affect how the old queue works.

I can return true to tell the job executor that we succeeded but I still don't understand why the arguments in the params of the job that has been retried is not correct.
Can there be serialization issues?
I think the job holds a list of instances of the class \Elastica\Script\Script in its params is that supported with the new jobqueue?

I think the job holds a list of instances of the class \Elastica\Script\Script in its params is that supported with the new jobqueue?

Ulala.. Per T161647 the new queue tries to serialize the job params into JSON, not use the PHP serialize/deserialize. For most of the job it's a drop-in replacement, cause the job params are just strings/numbers.

Does other Cirrus job use some other non-trivial types in params or the ElasticaWrite is the only one?

I have to check but I hope not.
I guess we have to fix that and handle the serialization on our own.

• fdans moved this task from Incoming to Radar on the Analytics board.Mar 29 2018, 4:50 PM

• EBjune triaged this task as Medium priority.Mar 29 2018, 5:06 PM

I guess we have to fix that and handle the serialization on our own.

I think so. The CirrusSearch jobs extend the deprecated Job class from Mediawiki while the new JobSpecification class actually doesn't allow non-JSON-serializable parameters.

So it seems the best way forward is to modify the Cirrus search jobs.

• EBjune moved this task from needs triage to Up Next on the Discovery-Search board.Mar 29 2018, 5:09 PM

EBernhardson claimed this task.Apr 3 2018, 12:03 AM

EBernhardson moved this task from Up Next to Current work on the Discovery-Search board.

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

• Pchelolo moved this task from Backlog to watching on the Services board.Apr 3 2018, 2:07 PM

• Pchelolo edited projects, added Services (watching); removed Services.

• mobrovac edited projects, added Services (blocked); removed Services (watching).Apr 4 2018, 4:59 PM

• mobrovac added a parent task: T189137: Migrate CirrusSearch jobs to Kafka queue.Apr 4 2018, 5:02 PM

Change 424039 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Restrict DataSender::sendData to Documents

https://gerrit.wikimedia.org/r/424039

Change 424040 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Convert ElasticaWrite job to use json compatible params

https://gerrit.wikimedia.org/r/424040

dcausse moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Apr 10 2018, 5:33 PM

Change 424039 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Restrict DataSender::sendData to Documents

https://gerrit.wikimedia.org/r/424039

Change 424040 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Convert ElasticaWrite job to use json compatible params

https://gerrit.wikimedia.org/r/424040

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2018-04-17 (1.31.0-wmf.30)).Apr 17 2018, 10:00 AM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Apr 17 2018, 3:53 PM

EBernhardson mentioned this in T192609: Search backend error during sending {numBulk} documents to the {index} index(s) after {tookMs}: {error_message}.Apr 19 2018, 10:50 PM

Change 427827 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Revert "Convert ElasticaWrite job to use json compatible params"

https://gerrit.wikimedia.org/r/427827

Change 427893 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427893

In T191024#4145680, @gerritbot wrote:

Change 427893 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427893

Please ignore, unrelated patch. I tagged the wrong bug when sending this patch.

Change 427827 abandoned by EBernhardson:
Revert "Convert ElasticaWrite job to use json compatible params"

Reason:
turned out to not be this patch causing problems.

https://gerrit.wikimedia.org/r/427827

I believe the fix for it has been deployed and we can try to proceed with switching cirrus search for some more wikis?

@Pchelolo yes I believe so as well.
We might want to test more wikis or all of them perhaps?
We have a cluster restart in progress that triggers write freezes (using the jobqueue to store jobs). It can be interesting to see how it reacts in such conditions?

We might want to test more wikis or all of them perhaps?

Right now we're running Kafka queue on test wikis and MediaWiki wiki. I don't think we should switch all of them right away, maybe choose a couple projects with a bit more traffic for the next iteration?

We have a cluster restart in progress that triggers write freezes (using the jobqueue to store jobs). It can be interesting to see how it reacts in such conditions?

I can see that from the graphs actually. It's not entirely clear how exactly it reacts, because it's only enabled for super-low-traffic wikis.

I think we can resolve this task and move the conversation to the original T189137

Restricted Application added a project: Analytics. · View Herald TranscriptApr 25 2018, 1:50 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

Exception thrown while running DataSender::sendData in cluster codfw: Data should be a Document, a Script or an array containing Documents and/or ScriptsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Exception thrown while running DataSender::sendData in cluster codfw: Data should be a Document, a Script or an array containing Documents and/or Scripts
Closed, ResolvedPublic
Actions

Related Objects
Search...