Page MenuHomePhabricator

CirrusSearch: Failing to reindex Meta
Closed, ResolvedPublic

Description

We're having trouble reindexing meta because we're hitting a page with an external link that contains invalid utf-8:

[2014-06-26 18:43:29,960][DEBUG][action.bulk ] [elastic1018] [metawiki_general_1403807864][5] failed to execute bulk item (index) index {[metawiki_general_1403807864][page][661035], source[{"namespace":2,"namespace_text":"User","title":"COIBot/Local/selftrans.narod.ru","timestamp":"2011-10-11T04:19:40Z","category":["Pages where template include size is exceeded","Noindexed pages","COIBot Local Reports"],"external_link":["wikipediatools.appspot.com/linksearch.jsp?set=top20&link=selftrans.narod.ru","wikipediatools.appspot.com/linksearch.jsp?set=top40&link=selftrans.narod.ru","//wikipediatools.appspot.com/linksearch.jsp?set=major&link=selftrans.narod.ru","http://www.google.com/search?num=10&hl=en&rls=en&q=selftrans.narod.ru","//www.google.com/search?num=100?h1=en&rls=en&q=selftrans.narod.ru+site:en.wikipedia.org","//www.google.com/search?num=100&hl=en&rls=en&q=selftrans.narod.ru+site:fr.wikipedia.org","//www.google.com/search?num=100&hl=en&rls=en&q=selftrans.narod.ru+site:de.wikipedia.org","//www.google.com/search?num=100&hl=en&rls=en&q=selftrans.narod.ru+site:meta.wikimedia.org","http://siteexplorer.search.yahoo.com/advsearch?p=selftrans.narod.ru&bwm=i&bwmf=d&bwms=p","//toolserver.org/~erwin85/xwiki.php?report=User:COIBot/LinkReports/selftrans.narod.ru&forcelive=1","//toolserver.org/~erwin85/xwiki.php?report=User:COIBot/Local/selftrans.narod.ru&forcelive=1","//tools.wmflabs.org/searchsbl/?url=selftrans.narod.ru","http://whois.domaintools.com/selftrans.narod.ru","http://www.aboutus.org/selftrans.narod.ru","http://www.malwaredomainlist.com/mdl.php?search=selftrans.narod.ru&colsearch=Domain&quantity=50","http://www.alexa.com/data/details/main?url=selftrans.narod.ru","http://213.180.199.13","//wikipediatools.appspot.com/linksearch.jsp?set=top20&link=213.180.199.13","//wikipediatools.appspot.com/linksearch.jsp?set=top40&link=213.180.199.13","//wikipediatools.appspot.com/linksearch.jsp?set=major&link=213.180.199.13","http://www.google.com/search?num=10&hl=en&rls=en&q=213.180.199.13","//www.google.com/search?num=100?h1=en&rls=en&q=213.180.199.13+site:en.wikipedia.org","//www.google.com/search?num=100&hl=en&rls=en&q=213.180.199.13+site:fr.wikipedia.org","//www.google.com/search?num=100&hl=en&rls=en&q=213.180.199.13+site:de.wikipedia.org","//www.google.com/search?num=100&hl=en&rls=en&q=213.180.199.13+site:meta.wikimedia.org","http://siteexplorer.search.yahoo.com/advsearch?p=213.180.199.13&bwm=i&bwmf=d&bwms=p","//tools.wmflabs.org/searchsbl/?url=213.180.199.13","http://whois.domaintools.com/213.180.199.13","http://www.aboutus.org/213.180.199.13","http://www.malwaredomainlist.com/mdl.php?search=213.180.199.13&colsearch=Domain&quantity=50","http://www.alexa.com/data/details/main?url=213.180.199.13","http://uk.wikipedia.org/wiki/Mediawiki:Spam-whitelist","http://www.google.com/search?q=%C3%83%C2%83%C3%82%C2%83%C3%83%C2%82%C3%82%C2%83%C3%83%C2%83%C3%82%C2%82%C3%83%C2%82%C3%82%C2%83%C3%83%C2%83%C3%82%C2%83%C3%83%C2%82%C3%82%C2%82%C3%83%C2%83%C3%82%C2%82%C3%83%C2%82%C3%82%C2%83%C3%83%C2%83%C3%82%C2%83%C3%83%C2%82%C3%82%C2%83%C3%83%C2%83%C3%82%C2%82%C3%83%C2%82%C3%82%C2%82%C3%83%C2%83%C3%82%C2%83%C3%83%C2%82%C3%82%C2%82%C3%83%C2%83%C3%82%C2%82%C3%83%C2%82%C3%82%C2%83%C3%83%C2%83%C3%82%C2%83%C3%83%C2%82%C3%82%C2%83%C3%83%C2%83%C3%82%C2%82%C3%83%C2%82%C3%82%C2%83%C3%83%C2%83%C3%82%C2%83%C3%83%C2%82%C3%82%C2%82%C3%83%C2%83%C3%82%C2%82%C3%83%C2%82%C3%82%C2%82%C3%83%C2%83%C3%82%C2%83%C3%83%C2%82%C3%82%C2%83%C3%83%C2%83%C3%82%C2%82%C3%83%C2%82%C3%82%C2%82%C3%83%C2%83%C3%82%C2%83%C3%83%C2%82%C3%82%C2%82%C3%83%C2%83%C3%82%C2%82%C3%83%C2%82%C3%82%C2%83%C3%83%C2%83%C3%82%C2%83%C3%83%C2%82%C3%82%C2%83%C3%83%C2%83%C3%82%C2%82%C3%83%C2%82%C3%82%C2%83%C3%83%C2%83%C3%82%C2%83%C3%83%C2%82%C3%82%C2%82%C3%83%C2%83%C3%82%C2%82%C3%83%C2%82%C3%82%C2%83%C3%83%C2%83%C3%82%C2%83%C3

...

java.lang.IllegalArgumentException: Document contains at least one immense term in field="external_link" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[68 74 74 70 3a 2f 2f 77 77 77 2e 67 6f 6f 67 6c 65 2e 63 6f 6d 2f 73 65 61 72 63 68 3f 71]...'

I'm not sure if this is a new feature of 1.2.1 or what.


Version: unspecified
Severity: normal

Details

Reference
bz67157

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:27 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz67157.

index: /metawiki_general_1403807864/page/661035 caused IllegalArgumentException[Document contains at least one immense term in field="external_link" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[68 74 74 70 3a 2f 2f 77 77 77 2e 67 6f 6f 67 6c 65 2e 63 6f 6d 2f 73 65 61 72 63 68 3f 71]...']

Looks like there are more such issues:
cirrus_log/arwikisource.reindex.log:Warning: Search backend error during reindex. Error message is: No enabled connection [Called from CirrusSearch\UpdateOneSearchIndexConfig::reindexInternal in /usr/local/apache/common-local/php-1.24wmf10/extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php at line 794] in /usr/local/apache/common-local/php-1.24wmf10/includes/debug/Debug.php on line 303
cirrus_log/commonswiki.reindex.log:Warning: Search backend error during sending 10 documents to the file index after 49. Regex syntax error: failed to execute script [Called from CirrusSearch\ElasticsearchIntermediary::failure in /usr/local/apache/common-local/php-1.24wmf10/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php at line 98] in /usr/local/apache/common-local/php-1.24wmf10/includes/debug/Debug.php on line 303
cirrus_log/commonswiki.reindex.log:Warning: Search backend error during sending 8 documents to the file index after 89. Regex syntax error: failed to execute script [Called from CirrusSearch\ElasticsearchIntermediary::failure in /usr/local/apache/common-local/php-1.24wmf10/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php at line 98] in /usr/local/apache/common-local/php-1.24wmf10/includes/debug/Debug.php on line 303
cirrus_log/ltwiktionary.reindex.log:Warning: Search backend error during sending 1 documents to the general index after 75. Regex syntax error: failed to execute script [Called from CirrusSearch\ElasticsearchIntermediary::failure in /usr/local/apache/common-local/php-1.24wmf10/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php at line 98] in /usr/local/apache/common-local/php-1.24wmf10/includes/debug/Debug.php on line 303
cirrus_log/metawiki.reindex.log:Warning: Search backend error during reindex. Error message is: Error in one or more bulk request actions:

Though, it isn't clear what the error is due to the broken syntax checker that we just fixed.

OK! Those error messages - the ones about regex syntax errors will stop masking their real errors tonight. They are caused by update errors. Simple enough to fix, and I'll put that in the same patch that fixes meta's problem.

arwikisource is different - I'm not sure what is up with it. It errors out (every time) with
Warning: Search backend error during reindex. Error message is: No enabled connection [Called from CirrusSearch\UpdateOneSearchIndexConfig::reindexInternal in /usr/local/apache/common-local/php-1.24wmf10/extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php at line 794] in /usr/local/apache/common-local/php-1.24wmf10/includes/debug/Debug.php on line 303

That means it got multiple http failures.

Change 142404 had a related patch set uploaded by Manybubbles:
Fix rare-ish errors

https://gerrit.wikimedia.org/r/142404

Change 142404 merged by jenkins-bot:
Fix rare-ish errors

https://gerrit.wikimedia.org/r/142404

Change 142412 had a related patch set uploaded by Manybubbles:
Fix rare-ish errors

https://gerrit.wikimedia.org/r/142412

Change 142413 had a related patch set uploaded by Manybubbles:
Fix rare-ish errors

https://gerrit.wikimedia.org/r/142413

Change 142413 merged by jenkins-bot:
Fix rare-ish errors

https://gerrit.wikimedia.org/r/142413

Change 142412 merged by jenkins-bot:
Fix rare-ish errors

https://gerrit.wikimedia.org/r/142412