Maniphest T192609

Search backend error during sending {numBulk} documents to the {index} index(s) after {tookMs}: {error_message}
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

Rolling 1.31.0-wmf.30 out to group1 wikis caused the mediawiki error rate to climb, mostly due to this warning in the CirrusSearch channel

Details

Subject	Repo	Branch	Lines +/-
Do not propagate Elastica doc modifications out of DataSender	mediawiki/extensions/CirrusSearch	wmf/1.31.0-wmf.30	+13 -0
Do not propagate Elastica doc modifications out of DataSender	mediawiki/extensions/CirrusSearch	master	+13 -0
Revert "Convert ElasticaWrite job to use json compatible params"	mediawiki/extensions/CirrusSearch	master	+28 -169

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	Release	thcipriani	T183969 1.31.0-wmf.30 deployment blockers
		Resolved	PRODUCTION ERROR	dcausse	T192609 Search backend error during sending {numBulk} documents to the {index} index(s) after {tookMs}: {error_message}

Event Timeline

thcipriani created this task.Apr 19 2018, 10:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 19 2018, 10:46 PM

Reedy added projects: Discovery-Search, CirrusSearch.Apr 19 2018, 10:47 PM

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptApr 19 2018, 10:47 PM

Changed priority to UBN! since I set as a train blocker.

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptApr 19 2018, 10:50 PM

The error message seems to be mostly of the form:

/somewiki_type_1234567/page/54321 caused failed to execute script

This is likely related to https://gerrit.wikimedia.org/r/#/q/If61bd58065b45ccfb1542f4f81409b8c21160d17, which was to resolve T191024

Change 427827 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Revert "Convert ElasticaWrite job to use json compatible params"

https://gerrit.wikimedia.org/r/427827

gerritbot added a project: Patch-For-Review.Apr 19 2018, 10:56 PM

Having a hard time reproducing directly, although i am seeing semi-regular occurrences on mediawiki.org. For reference this isn't only limited to page type, i've seen logs for archive as well. It's some sort of generic problem but elasticsearch isn't logging any errors, and mediawiki isn't logging any useful errors. Will need to revisit what is logged on the mediawiki side after figuring out what should have been logged here.

I think I could reproduce locally :

[_response:protected] => Array
    (
        [_index] => cirrustestwiki_content_1524217213
        [_type] => page
        [_id] => 340
        [status] => 400
        [error] => Array
            (
                [type] => illegal_argument_exception
                [reason] => failed to execute script
                [caused_by] => Array
                    (
                        [type] => class_cast_exception
                        [reason] => java.base/java.util.ArrayList cannot be cast to java.base/java.util.Map
                    )

            )

    )

I can only trigger this if I setup two elastic clusters and it's always the last one that fails.

I believe that since the noop handlers is emptied by the first run on the first cluster we send an empty array that is probably materialized as a json array (instead of empty object) causing the cast failure in the extra plugin code:

Map<String, String> detectorConfigs = (Map<String, String>) params.get("handlers");

I think we should duplicate the doc prior to making any change to it so that the second pass for codfw runs exactly with the same data as eqiad.

Change 427893 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427893

Change 427827 abandoned by EBernhardson:
Revert "Convert ElasticaWrite job to use json compatible params"

Reason:
turned out to not be this patch causing problems.

https://gerrit.wikimedia.org/r/427827

Looks like you all were able to recreate the issue \o/

Let me know if there are tests you'd like to run on the mwdebug servers that would be helpful for troubleshooting.

Change 427893 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427893

Change 427927 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@wmf/1.31.0-wmf.30] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427927

Change 427927 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.31.0-wmf.30] Do not propagate Elastica doc modifications out of DataSender

https://gerrit.wikimedia.org/r/427927

Mentioned in SAL (#wikimedia-operations) [2018-04-20T16:44:42Z] <dcausse@tin> Synchronized php-1.31.0-wmf.30/extensions/CirrusSearch/: T192609: Do not propagate Elastica doc modifications out of DataSender (duration: 01m 34s)

Last error of this kind was at 2018-04-20T16:42:36 and could not see any other occurrence in logstash. The traffic is so low that it's perhaps too early to say it's fixed.
For reference the bug can be identified in logstash searching for "Search backend error during sending" AND "caused failed to execute script".
I'll check the logs a bit later and close the issue if nothing wrong shows up.

ReleaseTaggerBot added projects: MW-1.32-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), MW-1.31-release-notes (WMF-deploy-2018-04-17 (1.31.0-wmf.30)).Apr 20 2018, 5:00 PM

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.Apr 20 2018, 6:05 PM

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Smalyshev moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.Apr 21 2018, 12:03 AM

greg moved this task from Untriaged to Dec2019/1.35.wmf.10+ on the Wikimedia-production-error board.Apr 22 2018, 6:13 AM

dcausse closed this task as Resolved.Apr 22 2018, 8:27 PM

dcausse claimed this task.

Liuxinyu970226 unsubscribed.Apr 23 2018, 12:56 PM

Jdforrester-WMF mentioned this in T192855: Remex enabled on all wikis in MW 1.30-wmf.30 exposing corruption (signatures coloring unrelated follow-up sections, etc.) on unfixed content.Apr 24 2018, 6:45 PM

Reedy removed a project: Patch-For-Review.May 17 2019, 9:42 PM

Krinkle moved this task from Dec2019/1.35.wmf.10+ to Resolved on the Wikimedia-production-error board.May 29 2019, 4:00 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM