Page MenuHomePhabricator

Better classification of CirrusSearch errors
Open, Needs TriagePublic5 Estimated Story Points

Description

As a maintainer of the search infrastructure I want to have more precise metrics regarding errors that occur between CirrusSearch and Elasticsearch so that I can better understand the problems on the cluster.

The CirrusSearch failures are currently categorized into 3 buckets: rejected, failed and unknown. The unknown bucket is currently seeing 1 error/minute so it would be interesting to know what these are, especially if they relate to indexing documents.

Graph: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&viewPanel=9

AC:

  • the number of unknown errors should be exceptional (close to 0/day)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 790323 had a related patch set uploaded (by EJoseph; author: EJoseph):

[mediawiki/extensions/CirrusSearch@master] Add a way to identify CirrusSearch error types on logstash

https://gerrit.wikimedia.org/r/790323

Change 790323 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add a way to identify CirrusSearch error types on logstash

https://gerrit.wikimedia.org/r/790323

The unknown error message

  • Search backend error during comp_suggest search for '[redacted]' after 252: unknown: Status code 503; upstream connect error or disconnect/reset before headers. reset reason: connection failure same error happens on more_like, comp_suggest queries, it could also be connection termination
  • Search backend error during sending {numBulk} documents to the UNKNOWN index(s) after 253: error message seems to be empty, this happens during send_data_write
  • Search backend error during counting links to 1 pages after 250: unknown: Received null for link count on 1 out of 1 pages on count_links
Search backend error during full_text search for '&&' after 88: parse_exception: parse_exception: Encountered " <AND> "&& "" at line 1, column 0.
Was expecting one of:
    <NOT> ...
    "+" ...
    "-" ...
    <BAREOPER> ...
    "(" ...
    "*" ...
    <QUOTED> ...
    <TERM> ...
    <PREFIXTERM> ...
    <WILDTERM> ...
    <REGEXPTERM> ...
    "[" ...
    "{" ...
    <NUMBER> ...
    <TERM> ...

on full_text

  • Search backend error during regex search for 'insource://>/' after 8: parsing_exception: Required [regex] on regex
  • index_not_found_exception: no such index on full_text
Search backend error during sending 1 documents to the enwiki_general_1617317067 index(s) after 146: bulk: Error in one or more bulk request actions:

update: /enwiki_general_1617317067/_doc/15024081 caused shard is not in primary mode [index: enwiki_general_1617317067]
  - update: /enwiki_general_1617317067/_doc/15024081 caused shard is not in primary mode [index: enwiki_general_1617317067]: retry_on_primary_exception: shard is not in primary mode (CurrentState[STARTED] shard is not in primary mode)

on send_data_write

Change 802082 had a related patch set uploaded (by EJoseph; author: EJoseph):

[mediawiki/extensions/CirrusSearch@master] Improve error classification on cirrussearch

https://gerrit.wikimedia.org/r/802082

Change 802082 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Improve error classification on cirrussearch

https://gerrit.wikimedia.org/r/802082

Change 804282 had a related patch set uploaded (by EJoseph; author: EJoseph):

[mediawiki/extensions/CirrusSearch@master] Fix numBulk sometimes not set in DataSender

https://gerrit.wikimedia.org/r/804282

Change 804282 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Fix numBulk sometimes not set in DataSender

https://gerrit.wikimedia.org/r/804282

Change 814854 had a related patch set uploaded (by EJoseph; author: EJoseph):

[mediawiki/extensions/CirrusSearch@master] Throw proper response exception when response is invalid

https://gerrit.wikimedia.org/r/814854

Change 815920 had a related patch set uploaded (by EJoseph; author: EJoseph):

[mediawiki/extensions/CirrusSearch@master] Add logic to validate response data on sendData

https://gerrit.wikimedia.org/r/815920

Change 815920 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add logic to validate response data on sendData

https://gerrit.wikimedia.org/r/815920

New unknown error messages

  • Search backend error during sending {numBulk} documents to the UNKNOWN index(s) after 253: error message seems to be empty, this happens during send_data_write
  • index_not_found_exception: no such index and [action.auto_create_index] contains [-*] on send_data_write
  • unknown: Received null for link count on 1 out of 1 pages on count_links

Change 828490 had a related patch set uploaded (by EJoseph; author: EJoseph):

[mediawiki/extensions/CirrusSearch@master] Throw exceptions when multisearch fails due to connection

https://gerrit.wikimedia.org/r/828490

Change 828798 had a related patch set uploaded (by EJoseph; author: EJoseph):

[mediawiki/extensions/CirrusSearch@master] Create new heuristic for config related issues

https://gerrit.wikimedia.org/r/828798

Change 828490 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Throw exceptions when multisearch fails due to connection

https://gerrit.wikimedia.org/r/828490

Change 828798 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Create new heuristic for config related issues

https://gerrit.wikimedia.org/r/828798

Change 814854 abandoned by EJoseph:

[mediawiki/extensions/CirrusSearch@master] Throw proper response exception when response is invalid

Reason:

https://gerrit.wikimedia.org/r/814854