Page MenuHomePhabricator

ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726"
Closed, ResolvedPublicBUG REPORT

Description

Steps to Reproduce:

Load Home dashboard from https://logstash-next.wikimedia.org

Actual Results:

Multiple notifications appear with error:

2 of 105 shards failed

The data you are seeing might be incomplete or wrong.

Looking at response details:

"caused_by": {
  "type": "illegal_argument_exception",
  "reason": "field expansion matches too many fields, limit: 1024, got: 1726"
}

Expected Results:

Dashboard loads without errors.

Event Timeline

Change 576967 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] elasticsearch: add max_clause_count setting

https://gerrit.wikimedia.org/r/576967

In testing I was able to work around this by increasing indices.query.bool.max_clause_count to a value greater than the number of fields matched. This comes with some tradeoffs wrt resource utilization, but it does appear to resolve the issue.

Created above patch for that approach at https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/576967/

And of course open to alternate solutions

Additionally, before moving on to max_clause_count I had experimented with settings like "default_field": "*" and "default_field": "message" in Kibana config query:queryString:options but the errors persisted.

It looks like the issue has been run into before in the Beats family of software. There is a template setting that allows us to define an array of fields that are default query fields:

{
  "settings": {
    "index": {
      "query": {
        default_field: [
          "message",
          "host",
          "...",
        ]
      }
    }
  }
}

The downside of this is that it needs to be set on index creation. I haven't tested whether or not it can be added to an existing index.

Thank you @Gehel @EBernhardson @dcausse @colewhite for looking at this!

To try and re-state an alternate approach that has emerged from the comments in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/576967/ and this task, we should be able to solve this by using copy_to in the logstash template to build more or less an in-house version of the (deprecated in 6 and removed in 7) _all field, and specify this as the default_field. Is that accurate?

I'll get to experimenting along these lines in the ELK7 cluster

I have concerns about re-implementing the _all field given that it is no longer "free." This means if we use copy_to, each log will take twice the disk space and the index cost in kind. With stack traces and request/response logs including response bodies, I can see this adding up quickly (unless we omit these from the new _all field).

I'm curious to see if just populating the default_field with a set of fields will save us the disk and index space and still enable us to query a reasonable subset of fields. If it doesn't work, then consider eating the extra resources hit.

The _all field wasn't free before, there was simply an implicit copy_to that existed everywhere. There will be a disk space cost, probably not double but still significant (there are already at least 3 copies of all the data, in the _source stored field, and then indexed both as text and keywords, but then the term dictionary is shared by all fields in the same segment). With the all field existing to hold term positions (support for queries with quoted strings), those could be turned off on the remaining fields that are auto-created. We could drop term frequencies at the same time, they are useless in cases that never sort by relevance (kibana). On a random logstash-next index i checked, positions and frequencies account for 20% of the data on disk.

I hope you don't mind, i started up a test on logstash-next with logstash-2020.02.20 recreated as ebernhardson-size-test where i think i've adjusted all text fields to have index_options: docs and copy_to: all, along with the all field defined as only a text field with standard indexing. A reindex is running to copy all the docs over and we can see how the sizes differ with a little less guessing on my part. It hasn't finished yet, but with 2M docs indexed it looks like the change might only be from ~800 bytes/doc to ~950 bytes/doc.

Separately, I noticed while testing this that while the production logstash runs _forcemerge?max_num_segments=1 on indices once they are no longer being written to, the logstash-next indices haven't been similarly optimized. It might save a little space (it will de-dupe the term dictionaries), but the main win there is at query time only 1 segments has to be queried instead of 30. If these indices were populated by logstash it might be worth looking into, but if they were created by a manual copy/reindex procedure it was probably simply missed.

I hope you don't mind, i started up a test on logstash-next with logstash-2020.02.20 recreated as ebernhardson-size-test where i think i've adjusted all text fields to have index_options: docs and copy_to: all, along with the all field defined as only a text field with standard indexing. A reindex is running to copy all the docs over and we can see how the sizes differ with a little less guessing on my part. It hasn't finished yet, but with 2M docs indexed it looks like the change might only be from ~800 bytes/doc to ~950 bytes/doc.

Not at all! That's very much appreciated. I'd be interested to see more about how you executed the test, and also how you might recommend structuring/altering the template to accommodate this. I haven't worked closely yet with copy_to so this is a learning experience for me. So far the ES5 logstash template has been adapted for ES7 here https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/571622/2/modules/profile/files/logstash/elasticsearch-template-7.json

Worth mentioning too that while the patch itself has yet to be merged, it has been manually applied in logstash-next eqiad while wip.

Separately, I noticed while testing this that while the production logstash runs _forcemerge?max_num_segments=1 on indices once they are no longer being written to, the logstash-next indices haven't been similarly optimized. It might save a little space (it will de-dupe the term dictionaries), but the main win there is at query time only 1 segments has to be queried instead of 30. If these indices were populated by logstash it might be worth looking into, but if they were created by a manual copy/reindex procedure it was probably simply missed.

These were populated by logstash, and there is a curator job configured to delete old indices, reduce replicas as indices age, and forcemerge after 2 days. But indeed it looks like the forcemerge part of the job is silently failing, or not executing at all. I don't see anything being logged for the forcemerge job ID by curator at all.

... with 2M docs indexed it looks like the change might only be from ~800 bytes/doc to ~950 bytes/doc.

Wow... that's far less than I expected. Thanks for the measurment, @EBernhardson!

akosiaris triaged this task as Medium priority.Mar 6 2020, 11:31 AM
health status index                  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green open    logstash-2020.02.22    hyEd8KXSRduj_-KaGG-tAw 1 2 8861462 0 19gb 6.3gb
green open    ebernhardson-size-test cUArIWh6RIyJsFqEfneRTg 1 0 8861462 0 7.2gb 7.2gb

So, we went from 769 bytes/doc to 878 bytes/doc, an increase of 14%. This does lose the ability to do phrase searches on fields other than all, but my guess is that would be exceptionally rarely, if ever, used.

To run this test i basically:

  • Fetch the existing settings + mappings into a text file, using jq to drop the top level map: curl localhost:9200/logstash-2020.02.22 | jq '.["logstash-2020.02.22"]'
  • find/replace to apply mapping changes: %s/"type" : "text"/\0, "index_options": "docs", "copy_to": "all"/g
  • Add the all field to properties. If not done it would get auto-created and double in size due to the keyword sub-field:
"all": {
    "type": "text",
    "norms": "false"
}
  • In the settings key, remove a variety of things that aren't settable: provided_name, creation_date, uuid, version
  • Send back to elasticsearch: curl -XPUT localhost:9200/ebernhardson-size-test -H 'Content-Type: application/json' -d @myfile.json
  • Run the reindex: curl -XPOST localhost:9200/_reindex -H 'Content-Type: application/json' -d '{"source":{"index": "logstash-2020.02.22"}, "dest":{"index": "ebernhardson-size-test"}}'

Change 571622 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: add ES 7 compatible logstash template

https://gerrit.wikimedia.org/r/571622

@EBernhardson thanks!

I've updated the elasticsearch template in https://gerrit.wikimedia.org/r/571622 to include "index_options": "docs", "copy_to": "all" and properties for the "all" field. I've manually applied the template from this patch to eqiad logstash-next.

Also appended "default_field": "all" to the Kibana query:queryString:options (in logstash-next eqiad)

Then, after that, kicked off a reindex for logstash-restbase-reindex-2020.03.09 (because it's small) which looked ok, and now reindexes of logstash-2020.03.09 and logstash-mediawiki-2020.03.09 are under way. I also briefly updated the logstash es output index prefix to include v2 to see how this behaves as new logs arrive.

With these changes I'm no longer seeing "field expansion matches too many fields" errors, and from cursory look saved query/visualizations/dashboards are becoming more populated as the reindex progresses (and from the "v2" prefixed indices). But many dashboards/visualizations are still incomplete (which is no huge surprise). After the indices roll over tonight we should have a more complete idea of how well this is working.

Change 571622 merged by Herron:
[operations/puppet@production] logstash: add ES 7 compatible logstash template

https://gerrit.wikimedia.org/r/571622

Change 579346 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: curator: set ignore_empty_list true for replica job

https://gerrit.wikimedia.org/r/579346

Change 579346 merged by Herron:
[operations/puppet@production] logstash: curator: set ignore_empty_list true for replica job

https://gerrit.wikimedia.org/r/579346

Change 576967 abandoned by Herron:
elasticsearch: add max_clause_count setting

Reason:
going with I2e690d26e5bd4d9961f261eb049f33ef58ad2588 instead

https://gerrit.wikimedia.org/r/576967

Change 583449 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] ELK7: set "index.query.default_field":"all" in logstash ES template

https://gerrit.wikimedia.org/r/583449

Change 583449 merged by Herron:
[operations/puppet@production] ELK7: set "index.query.default_field":"all" in logstash ES template

https://gerrit.wikimedia.org/r/583449

Not sure if the same issue or not, but I'm seeing storms of "shards failed" errors when using the new Logstash:

Screenshot 2020-03-31 at 19.20.07.png (1×2 px, 547 KB)

Not sure if the same issue or not, but I'm seeing storms of "shards failed" errors when using the new Logstash:

Screenshot 2020-03-31 at 19.20.07.png (1×2 px, 547 KB)

It is indeed related, you should see this go away after Mar 25 (there's a note about that in the motd too, but maybe it could be made clearer). The older indices are in the process of (slowly) being reindexed so that the fix is applied to all logs in ELK7.

herron claimed this task.