Page MenuHomePhabricator

OOM Issues in Elasticsearch 5.x on cindy
Closed, ResolvedPublic

Description

While getting cindy ready to run the browser test suite with elasticsearch 5, i ran into one of the tests causing elasticsearch to explode from ~200M memory usage up to 2G (the limit), and then dieing due to OOM. Look into why this happens, and what we can do to fix it.

Event Timeline

Approximately (didn't exhaustively try query lengths) the simplest query that will trigger the problem of eating all available memory is:

{
  "query": {
    "multi_match": {
      "fields": [
        "suggest"
      ],
      "query": "vdyējūyeyafqhrqtwtfmvvbv不顾要死不活的姑娘风景如小D3:n t q h ra r n q r n q n r q r n w t n ran s g是否能Z或者"
    }
  }
}

If we trim the query down a bit more, we can get elasticsearch to not fail quite so badly and instead issue org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024

{
  "query": {
    "multi_match": {
      "fields": [
        "suggest"
      ],
      "query": "vdyējūyeyafqhrqtwtfmvvbv不顾要死不活的姑娘风景如小D3" 
    }
  }
}

I took jstack logs in a while loop while crashing the instance, they are in cirrus-browser-bot.eqiad.wmflabs:/srv/mediawiki-vagrant/es5_crash_jstack. The files are just named based on the order of dumps. I starting the query around stack dump 7 to 9, by 14 the server had crashed.

My best guess here, having not dug through the relevant lucene code yet, is that the expansion of the search, and adding clauses to the boolean query do not happen in strict lockstep. The query is expected to fail at the point of adding to the boolean query but instead the prior expansion step is managing to completely explode memory prior to that.

Simple reproduction script outside mediawiki:

curl -s -XDELETE localhost:9200/oom_repro | jq .
curl -s -XPUT localhost:9200/oom_repro -d '{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "suggest": {
            "tokenizer": "standard",
            "type": "custom",
            "filter": [
              "suggest_shingle"
            ]
          }
        },
        "filter": {
          "suggest_shingle": {
            "type": "shingle",
            "output_unigrams": true,
            "min_shingle_size": "2",
            "max_shingle_size": "3"
          }
        }
      }
    }
  },
  "mappings": {
    "page": {
      "properties": {
        "suggest": {
          "analyzer": "suggest",
          "type": "text"
        }
      }
    }
  }
}' | jq .

curl -XPOST localhost:9200/oom_repro/page/_search --data-binary '{
  "query": {
    "multi_match": {
      "fields": [
        "suggest"
      ],
      "query": "vdyējūyeyafqhrqtwtfmvvbv不顾要死不活的姑娘风景如小D3:n t q h ra r n q r n q n r q r n w t n ran s g是否能Z或者"
    }
  }
}'

5.2.2: OOM
5.2.0: OOM
5.1.2: Happy

So the regression happened somewhere between 5.1.2 and 5.2.0.

Looking at this now as it seems to be a blocker.

Created https://github.com/elastic/elasticsearch/issues/23509
I think we can workaround the problem by setting a limit on the shingle search analyzer.
I will create a patch to change the analysis chain and expose an analyzer like that.
Unfortunately it's not guarantee that it only happens with shingles, other analyzer could implement PositionLengthAttribute (it's one responsible from analyzing a graph).

Change 341795 had a related patch set uploaded (by DCausse):
[mediawiki/extensions/CirrusSearch] Workaround OOM issue on ngrams field

https://gerrit.wikimedia.org/r/341795

The workaround requires a change in the analysis settings, I think it's possible to apply it without a reindex but we need to close the index. This is maybe doable during the upgrade phase, just after we depool codfw. I'll write a bash script to confirm the feasibility.

Change 341795 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch] Workaround OOM issue on ngrams field

https://gerrit.wikimedia.org/r/341795