Page MenuHomePhabricator

Reduce the number of fields declared in elasticsearch by logstash
Open, LowPublic

Description

P6289 is the elasticsearch mapping of the current logstash index. We are hitting the 1000 fields per index limit. Just raising that limit is not a good idea, we should sanitize what we send to elasticsearch.

A quick look at that mapping indicates that some log producers are throwing mostly random objects at logstash (we can see fields name like Q27938#X which don't seem to make any sense in logstash.

A dirty count show that the types with most fields are mediawiki, followed closely by restbase. In the mediawiki type, there seem to be a large number of fields generated by eventbus, including some slightly complex objects. (The most fields is more a guess than a precise count).

I see a number of fields starting with err_ and ending with a number. This looks like they should at least be consolidated on the same field.

More analysis is required to actually get anywhere...

Event Timeline

Gehel created this task.Nov 8 2017, 6:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 8 2017, 6:00 PM
dcausse added a comment.EditedNov 9 2017, 10:53 AM

I think we should introduce a pattern where log emitters can freely send large and complex objects that would be only available for debugging purposes on a per event basis: no need to search/aggregate them.
The current strategy we have is to index everything and to allow aggregation we index 2 elastic fields per json field.
Looking at some EventBus logs I see : https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.11.09/mediawiki?id=AV-gV9NNSUnOz-leF_9O&_g=h@44136fa

I'm afraid that we waste a lot fields here:

  • response being a complex PHP object we end up having everything duplicated with numeric keys (0 == "code", 2 == "headers", ...)
  • it' very likely that most of the fields indexed here will never be searched

This is just an example but this problem is certainly present in other parts of the system.
Would log emitters agree to follow some naming convention where everything nested under "debug" (or another name) would be ignored in the mappings?
This would allow to have the data readable in the json event but would not pollute the elasticsearch mapping.
When a specific field is found to be interesting we can then move it to the top level to have it searchable/aggregable.

dcausse added a comment.EditedNov 9 2017, 11:28 AM

Typically logstash/elastic is not able to sustain this kind of events: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.11.08/mediawiki?id=AV-ZBn5-gaOKEclNGWio&_g=h@8b5b71a
EventBus.events.params should be marked as debug only in some ways

In general there are several issues we've observed:

Mapping conflicts between different services/areas/extensions - currently property->type mappings are global, so conflicts can appear between several completely unrelated areas of code. It's really hard to prevent as developers might not even know about the other service which logs might conflict with. In the past several months we've had several occasions when RESTBase EventBus and other logs completely disappear because the types were conflicting with absolutely unrelated pieces of infrastructure. I think enforcing global property-type mapping across the whole organization just by coding conventions is not practical - mistakes will still be made. So we need programmatic solution. A couple of ideas, however I'm not sure if they're feasible:

  1. Make the mappings per log channel - in that case the conflicts can only arise within the service, service developers have more control over their own logging.
  2. Prepend all property names with log channel name - effectively the same as previous, but more nasty.

The second issue is the number of fields. I completely support @dcausse idea of having a possibility to dump random objects into logs for debugging purposes without mapping them. Normally these big random objects are logged to debug some rare condition in production at the point when it's not even clear which fields are gonna be in the object and which fields to look at. Alternatively we can force JSON-stringify big nested objects, however that's imperfect cause not everything is JSON-strigifyable.

Pchelolo moved this task from Backlog to watching on the Services board.Nov 9 2017, 3:08 PM
Pchelolo edited projects, added Services (watching); removed Services.

My fear is that the "too many fields" problem is going to be more painful than the mapping conflicts one.
Speaking only about short term solutions:

  • For mapping conflicts we can rename
  • For too many fields I don't have yet a short term solution

Basically I don't know if we need to worry right now nor why we end up having this issue recently.

Setting-up a long term solution will take time, we should find a short term solution to apply when this problem occurs.

For long term solutions I'm more and more against dynamic mappings, I'm afraid that it'll never work well...

Change 392590 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [logstash] add debug_blob field

https://gerrit.wikimedia.org/r/392590

Change 392591 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [WIP] [logstash] Add a way to move some data to debug_blob

https://gerrit.wikimedia.org/r/392591

Change 392603 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [logstash] log all elastic queries

https://gerrit.wikimedia.org/r/392603

I'm investigating two approaches here:

  1. provide a way inside logstash filters to blacklist some known fields (move them into a debug_blob field that is not indexed)
  2. investigate disabling dynamic mapping where the first step would be to log all elastic queries to discover what are the fields we currently use. It'll allow to create the first static mapping.

First options requires logstash 5.6 because of a bug that prevents manipulating complex fields.
Second option will require additional work to write a parser to extract the list of fields and as far as I understood is not something we are enthusiast about (require manipulating the mapping ahead of time for new loggers). More resources to support the idea that dynamic mappings can lead to giant mess:

debt triaged this task as High priority.
debt lowered the priority of this task from High to Low.Dec 12 2017, 6:17 PM
debt added a subscriber: debt.

We'll table this work for now - until the new person that will take over Logstash is hired.

This is (unsurprisingly) still going on. Over the last 30 days we have ~10k unique fields. A quick parse shows the average days has at least 1000 fields, and the worst ~1600. This seems like a problem that needs a better solution than asking developers to pretty-please log within this (unwritten) set of guidelines. As time goes by, a pre-defined list of a hundred or so fields that will be indexed starts to seem reasonable. It would be nice if somehow we could manage to copy the text content of all the rest of it into some unstructured indexed field so that it's still searchable, but not the end of the world.

herron added a subscriber: herron.Jan 22 2019, 2:44 PM