Page MenuHomePhabricator

Reduce the number of fields declared in elasticsearch by logstash
Open, LowPublic


P6289 is the elasticsearch mapping of the current logstash index. We are hitting the 1000 fields per index limit. Just raising that limit is not a good idea, we should sanitize what we send to elasticsearch.

A quick look at that mapping indicates that some log producers are throwing mostly random objects at logstash (we can see fields name like Q27938#X which don't seem to make any sense in logstash.

A dirty count show that the types with most fields are mediawiki, followed closely by restbase. In the mediawiki type, there seem to be a large number of fields generated by eventbus, including some slightly complex objects. (The most fields is more a guess than a precise count).

I see a number of fields starting with err_ and ending with a number. This looks like they should at least be consolidated on the same field.

More analysis is required to actually get anywhere...

Event Timeline

I think we should introduce a pattern where log emitters can freely send large and complex objects that would be only available for debugging purposes on a per event basis: no need to search/aggregate them.
The current strategy we have is to index everything and to allow aggregation we index 2 elastic fields per json field.
Looking at some EventBus logs I see :*/logstash-2017.11.09/mediawiki?id=AV-gV9NNSUnOz-leF_9O&_g=h@44136fa

I'm afraid that we waste a lot fields here:

  • response being a complex PHP object we end up having everything duplicated with numeric keys (0 == "code", 2 == "headers", ...)
  • it' very likely that most of the fields indexed here will never be searched

This is just an example but this problem is certainly present in other parts of the system.
Would log emitters agree to follow some naming convention where everything nested under "debug" (or another name) would be ignored in the mappings?
This would allow to have the data readable in the json event but would not pollute the elasticsearch mapping.
When a specific field is found to be interesting we can then move it to the top level to have it searchable/aggregable.

Typically logstash/elastic is not able to sustain this kind of events:*/logstash-2017.11.08/mediawiki?id=AV-ZBn5-gaOKEclNGWio&_g=h@8b5b71a should be marked as debug only in some ways

In general there are several issues we've observed:

Mapping conflicts between different services/areas/extensions - currently property->type mappings are global, so conflicts can appear between several completely unrelated areas of code. It's really hard to prevent as developers might not even know about the other service which logs might conflict with. In the past several months we've had several occasions when RESTBase Event-Platform and other logs completely disappear because the types were conflicting with absolutely unrelated pieces of infrastructure. I think enforcing global property-type mapping across the whole organization just by coding conventions is not practical - mistakes will still be made. So we need programmatic solution. A couple of ideas, however I'm not sure if they're feasible:

  1. Make the mappings per log channel - in that case the conflicts can only arise within the service, service developers have more control over their own logging.
  2. Prepend all property names with log channel name - effectively the same as previous, but more nasty.

The second issue is the number of fields. I completely support @dcausse idea of having a possibility to dump random objects into logs for debugging purposes without mapping them. Normally these big random objects are logged to debug some rare condition in production at the point when it's not even clear which fields are gonna be in the object and which fields to look at. Alternatively we can force JSON-stringify big nested objects, however that's imperfect cause not everything is JSON-strigifyable.

My fear is that the "too many fields" problem is going to be more painful than the mapping conflicts one.
Speaking only about short term solutions:

  • For mapping conflicts we can rename
  • For too many fields I don't have yet a short term solution

Basically I don't know if we need to worry right now nor why we end up having this issue recently.

Setting-up a long term solution will take time, we should find a short term solution to apply when this problem occurs.

For long term solutions I'm more and more against dynamic mappings, I'm afraid that it'll never work well...

Change 392590 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [logstash] add debug_blob field

Change 392591 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [WIP] [logstash] Add a way to move some data to debug_blob

Change 392603 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [logstash] log all elastic queries

I'm investigating two approaches here:

  1. provide a way inside logstash filters to blacklist some known fields (move them into a debug_blob field that is not indexed)
  2. investigate disabling dynamic mapping where the first step would be to log all elastic queries to discover what are the fields we currently use. It'll allow to create the first static mapping.

First options requires logstash 5.6 because of a bug that prevents manipulating complex fields.
Second option will require additional work to write a parser to extract the list of fields and as far as I understood is not something we are enthusiast about (require manipulating the mapping ahead of time for new loggers). More resources to support the idea that dynamic mappings can lead to giant mess:

debt triaged this task as High priority.Nov 28 2017, 6:03 PM
debt moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.
debt lowered the priority of this task from High to Low.Dec 12 2017, 6:17 PM
debt subscribed.

We'll table this work for now - until the new person that will take over Logstash is hired.

This is (unsurprisingly) still going on. Over the last 30 days we have ~10k unique fields. A quick parse shows the average days has at least 1000 fields, and the worst ~1600. This seems like a problem that needs a better solution than asking developers to pretty-please log within this (unwritten) set of guidelines. As time goes by, a pre-defined list of a hundred or so fields that will be indexed starts to seem reasonable. It would be nice if somehow we could manage to copy the text content of all the rest of it into some unstructured indexed field so that it's still searchable, but not the end of the world.

An additional 2¢

When we implemented logging for Kask we experimented with using a schema for log messages. Using arbitrary attributes isn't possible without modifying the LogMessage struct, (which I hope would at least prompt a conversation).

We had this particular problem in mind when doing so, but I think the idea stands on its own regardless. Where we're logging similar things across systems, they should be consistently named, if not to simplify searches, then so people don't have to guess at all the attributes used to mean the same thing.

@Eevans I think this is the way forward, as for schemas elastic is promoting what they call ECS (elastic common schemas). This might be a good source of inspiration if we want to follow this route.

We've been working with service owners to fix the obvious offenders in terms of "fields spam" and bumped the fields limit to 2048. We're also alerting on indexing failures when Logstash gets errors from Elasticsearch. ATM only kartotherian bumps into the limit, although that doesn't necessarily mean kartotherian is the "fields spammer" in this case. I'll be following up with a patch to further bump the limit to 4096, that should be plenty to fully ingest all logs we're producing now.

Change 554303 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: bump max fields to 4096

Change 554303 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: bump max fields to 4096

Change 392591 abandoned by DCausse:
[WIP] [logstash] Add a way to move some data to debug_blob

Change 392590 abandoned by DCausse:
[logstash] add debug_blob field

Change 392603 abandoned by DCausse:
[logstash] log all elastic queries

Mentioned in SAL (#wikimedia-operations) [2023-03-30T09:04:46Z] <godog> silence LogstashIndexingFailures during investigation T180051

Change 904264 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: collapse eventgate response_body field

Change 904264 merged by Cwhite:

[operations/puppet@production] logstash: remove eventgate response_body field

Change 904265 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: restore logstash index patch level

Change 904265 merged by Cwhite:

[operations/puppet@production] logstash: restore logstash index patch level

Change 942799 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash remove wikifunctions response field

Change 942799 merged by Cwhite:

[operations/puppet@production] logstash remove wikifunctions response field

Change 944194 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] Revert "logstash remove wikifunctions response field"

Change 944194 merged by Cwhite:

[operations/puppet@production] Revert "logstash remove wikifunctions response field"