Page MenuHomePhabricator

Search backend error during sending {numBulk} documents to the {index} index(s) after {tookMs}: {error_message}
Closed, ResolvedPublicPRODUCTION ERROR

Description

Rolling 1.31.0-wmf.30 out to group1 wikis caused the mediawiki error rate to climb, mostly due to this warning in the CirrusSearch channel

Event Timeline

thcipriani triaged this task as Unbreak Now! priority.Apr 19 2018, 10:48 PM

Changed priority to UBN! since I set as a train blocker.

The error message seems to be mostly of the form:

/somewiki_type_1234567/page/54321 caused failed to execute script

This is likely related to https://gerrit.wikimedia.org/r/#/q/If61bd58065b45ccfb1542f4f81409b8c21160d17, which was to resolve T191024

Change 427827 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Revert "Convert ElasticaWrite job to use json compatible params"

https://gerrit.wikimedia.org/r/427827

Having a hard time reproducing directly, although i am seeing semi-regular occurrences on mediawiki.org. For reference this isn't only limited to page type, i've seen logs for archive as well. It's some sort of generic problem but elasticsearch isn't logging any errors, and mediawiki isn't logging any useful errors. Will need to revisit what is logged on the mediawiki side after figuring out what should have been logged here.

I think I could reproduce locally :

[_response:protected] => Array
    (
        [_index] => cirrustestwiki_content_1524217213
        [_type] => page
        [_id] => 340
        [status] => 400
        [error] => Array
            (
                [type] => illegal_argument_exception
                [reason] => failed to execute script
                [caused_by] => Array
                    (
                        [type] => class_cast_exception
                        [reason] => java.base/java.util.ArrayList cannot be cast to java.base/java.util.Map
                    )

            )

    )

I can only trigger this if I setup two elastic clusters and it's always the last one that fails.

I believe that since the noop handlers is emptied by the first run on the first cluster we send an empty array that is probably materialized as a json array (instead of empty object) causing the cast failure in the extra plugin code:

Map<String, String> detectorConfigs = (Map<String, String>) params.get("handlers");

I think we should duplicate the doc prior to making any change to it so that the second pass for codfw runs exactly with the same data as eqiad.

Change 427893 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427893

Change 427827 abandoned by EBernhardson:
Revert "Convert ElasticaWrite job to use json compatible params"

Reason:
turned out to not be this patch causing problems.

https://gerrit.wikimedia.org/r/427827

Looks like you all were able to recreate the issue \o/

Let me know if there are tests you'd like to run on the mwdebug servers that would be helpful for troubleshooting.

Change 427893 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427893

Change 427927 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@wmf/1.31.0-wmf.30] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427927

Change 427927 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.31.0-wmf.30] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427927

Mentioned in SAL (#wikimedia-operations) [2018-04-20T16:44:42Z] <dcausse@tin> Synchronized php-1.31.0-wmf.30/extensions/CirrusSearch/: T192609: Do not propagate Elastica doc modifications out of DataSender (duration: 01m 34s)

Last error of this kind was at 2018-04-20T16:42:36 and could not see any other occurrence in logstash. The traffic is so low that it's perhaps too early to say it's fixed.
For reference the bug can be identified in logstash searching for "Search backend error during sending" AND "caused failed to execute script".
I'll check the logs a bit later and close the issue if nothing wrong shows up.

dcausse claimed this task.
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM