Rolling 1.31.0-wmf.30 out to group1 wikis caused the mediawiki error rate to climb, mostly due to this warning in the CirrusSearch channel
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Release | thcipriani | T183969 1.31.0-wmf.30 deployment blockers | ||
Resolved | PRODUCTION ERROR | dcausse | T192609 Search backend error during sending {numBulk} documents to the {index} index(s) after {tookMs}: {error_message} |
Event Timeline
The error message seems to be mostly of the form:
/somewiki_type_1234567/page/54321 caused failed to execute script
This is likely related to https://gerrit.wikimedia.org/r/#/q/If61bd58065b45ccfb1542f4f81409b8c21160d17, which was to resolve T191024
Change 427827 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Revert "Convert ElasticaWrite job to use json compatible params"
Having a hard time reproducing directly, although i am seeing semi-regular occurrences on mediawiki.org. For reference this isn't only limited to page type, i've seen logs for archive as well. It's some sort of generic problem but elasticsearch isn't logging any errors, and mediawiki isn't logging any useful errors. Will need to revisit what is logged on the mediawiki side after figuring out what should have been logged here.
I think I could reproduce locally :
[_response:protected] => Array ( [_index] => cirrustestwiki_content_1524217213 [_type] => page [_id] => 340 [status] => 400 [error] => Array ( [type] => illegal_argument_exception [reason] => failed to execute script [caused_by] => Array ( [type] => class_cast_exception [reason] => java.base/java.util.ArrayList cannot be cast to java.base/java.util.Map ) ) )
I can only trigger this if I setup two elastic clusters and it's always the last one that fails.
I believe that since the noop handlers is emptied by the first run on the first cluster we send an empty array that is probably materialized as a json array (instead of empty object) causing the cast failure in the extra plugin code:
Map<String, String> detectorConfigs = (Map<String, String>) params.get("handlers");
I think we should duplicate the doc prior to making any change to it so that the second pass for codfw runs exactly with the same data as eqiad.
Change 427893 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Do not propagate Elastica doc modifications out of DataSender
Change 427827 abandoned by EBernhardson:
Revert "Convert ElasticaWrite job to use json compatible params"
Reason:
turned out to not be this patch causing problems.
Looks like you all were able to recreate the issue \o/
Let me know if there are tests you'd like to run on the mwdebug servers that would be helpful for troubleshooting.
Change 427893 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Do not propagate Elastica doc modifications out of DataSender
Change 427927 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@wmf/1.31.0-wmf.30] Do not propagate Elastica doc modifications out of DataSender
Change 427927 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.31.0-wmf.30] Do not propagate Elastica doc modifications out of DataSender
Mentioned in SAL (#wikimedia-operations) [2018-04-20T16:44:42Z] <dcausse@tin> Synchronized php-1.31.0-wmf.30/extensions/CirrusSearch/: T192609: Do not propagate Elastica doc modifications out of DataSender (duration: 01m 34s)
Last error of this kind was at 2018-04-20T16:42:36 and could not see any other occurrence in logstash. The traffic is so low that it's perhaps too early to say it's fixed.
For reference the bug can be identified in logstash searching for "Search backend error during sending" AND "caused failed to execute script".
I'll check the logs a bit later and close the issue if nothing wrong shows up.