Page MenuHomePhabricator

Search backend error during sending {numBulk} documents to the {index} index(s) after {tookMs}: {error_message}
Closed, ResolvedPublic

Description

Rolling 1.31.0-wmf.30 out to group1 wikis caused the mediawiki error rate to climb, mostly due to this warning in the CirrusSearch channel

Details

Related Gerrit Patches:
mediawiki/extensions/CirrusSearch : wmf/1.31.0-wmf.30Do not propagate Elastica doc modifications out of DataSender
mediawiki/extensions/CirrusSearch : masterDo not propagate Elastica doc modifications out of DataSender
mediawiki/extensions/CirrusSearch : masterRevert "Convert ElasticaWrite job to use json compatible params"

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 19 2018, 10:46 PM
Restricted Application added a project: Discovery. · View Herald TranscriptApr 19 2018, 10:47 PM
thcipriani triaged this task as Unbreak Now! priority.Apr 19 2018, 10:48 PM

Changed priority to UBN! since I set as a train blocker.

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptApr 19 2018, 10:50 PM

The error message seems to be mostly of the form:

/somewiki_type_1234567/page/54321 caused failed to execute script

This is likely related to https://gerrit.wikimedia.org/r/#/q/If61bd58065b45ccfb1542f4f81409b8c21160d17, which was to resolve T191024

Change 427827 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Revert "Convert ElasticaWrite job to use json compatible params"

https://gerrit.wikimedia.org/r/427827

Having a hard time reproducing directly, although i am seeing semi-regular occurrences on mediawiki.org. For reference this isn't only limited to page type, i've seen logs for archive as well. It's some sort of generic problem but elasticsearch isn't logging any errors, and mediawiki isn't logging any useful errors. Will need to revisit what is logged on the mediawiki side after figuring out what should have been logged here.

dcausse added a subscriber: dcausse.EditedApr 20 2018, 10:19 AM

I think I could reproduce locally :

[_response:protected] => Array
    (
        [_index] => cirrustestwiki_content_1524217213
        [_type] => page
        [_id] => 340
        [status] => 400
        [error] => Array
            (
                [type] => illegal_argument_exception
                [reason] => failed to execute script
                [caused_by] => Array
                    (
                        [type] => class_cast_exception
                        [reason] => java.base/java.util.ArrayList cannot be cast to java.base/java.util.Map
                    )

            )

    )

I can only trigger this if I setup two elastic clusters and it's always the last one that fails.

I believe that since the noop handlers is emptied by the first run on the first cluster we send an empty array that is probably materialized as a json array (instead of empty object) causing the cast failure in the extra plugin code:

Map<String, String> detectorConfigs = (Map<String, String>) params.get("handlers");

I think we should duplicate the doc prior to making any change to it so that the second pass for codfw runs exactly with the same data as eqiad.

Change 427893 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427893

Change 427827 abandoned by EBernhardson:
Revert "Convert ElasticaWrite job to use json compatible params"

Reason:
turned out to not be this patch causing problems.

https://gerrit.wikimedia.org/r/427827

Looks like you all were able to recreate the issue \o/

Let me know if there are tests you'd like to run on the mwdebug servers that would be helpful for troubleshooting.

Change 427893 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427893

Change 427927 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@wmf/1.31.0-wmf.30] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427927

Change 427927 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.31.0-wmf.30] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427927

Mentioned in SAL (#wikimedia-operations) [2018-04-20T16:44:42Z] <dcausse@tin> Synchronized php-1.31.0-wmf.30/extensions/CirrusSearch/: T192609: Do not propagate Elastica doc modifications out of DataSender (duration: 01m 34s)

dcausse added a comment.EditedApr 20 2018, 4:54 PM

Last error of this kind was at 2018-04-20T16:42:36 and could not see any other occurrence in logstash. The traffic is so low that it's perhaps too early to say it's fixed.
For reference the bug can be identified in logstash searching for "Search backend error during sending" AND "caused failed to execute script".
I'll check the logs a bit later and close the issue if nothing wrong shows up.

dcausse closed this task as Resolved.Apr 22 2018, 8:27 PM
dcausse claimed this task.
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM