Page MenuHomePhabricator

Create a separate logstash ElasticSearch index for schemaed events
Closed, ResolvedPublic

Description

In https://phabricator.wikimedia.org/T248987#6545103, @Krinkle noted that using nested objects in JSON data that gets imported into logstash isn't great. Best practice for logstash is to use flat data structures. The reasoning behind this is to avoid type conflicts e.g. where one datum might have a field as a string, and another a field with the same name as a nested object. If everything is flat, type conflicts are rarer.

But, logstash does work fine with nested types, as long as there are no name conflicts in the same ElasticSearch index. Since events are strictly schemaed, it is sometimes difficult to add new concrete fields to schemas when all that is desired is to capture some arbitrary context data. In Event Platform schemas, we use 'map types' for this. In JSON, map types look just like JSON objects, but in JSONSchema, we can differentiate between a regular nested object that could have any values with any type, and an object that has all values with a specific type.

mediawiki/client/error events are ingested into logstash. Currently, these events have a tags field, which IIUC conflicts with tags as added by logstash itself. We're ok with renaming this field to avoid the conflict, but we'd like to continue using nested map types in this data (so we don't have to add new context specific top level fields to a generic event schema).

If event data that is ingested into logstash had its own index, we could more safely reason about the types of field names and avoid type conflicts.

Can we add a new ElasticSearch index for schemaed event data?

Event Timeline

Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)

hey @Ottomata!

we've been working on a consolidated logging schema that might prove to be very helpful for this particular task. We'd love to talk to you about it, what is the best way to do this? we can setup a meeting or just share details in this phab task. Thanks!

Hiya! Interesting indeed. Maybe share some details and we can discuss? Happy to have a meeting too.

@herron For context, this relates to what we did with mediawiki exception/error messages, which have their own index and thus reduce impact of Logstash blowing up when there are too many unique keys, conflicting key names or value types (e.g. foo in one message, and foo.bar in another).

I recommended they ask you to give anything we ingest into Logstash from EventGate be also given an overal index group so that they can at worst clash with other things coming from the same system, where all messages do already follow a common schema and one that is known and reviewed ahead of time with this kind of concern in mind.

@herron For context, this relates to what we did with mediawiki exception/error messages, which have their own index and thus reduce impact of Logstash blowing up when there are too many unique keys, conflicting key names or value types (e.g. foo in one message, and foo.bar in another).

I recommended they ask you to give anything we ingest into Logstash from EventGate be also given an overal index group so that they can at worst clash with other things coming from the same system, where all messages do already follow a common schema and one that is known and reviewed ahead of time with this kind of concern in mind.

As I recall moving mw logs to a separate index was done primarily to provide us with additional headroom because we were exceeding the max fields per-index limit in the logstash-yyyy.mm.dd daily indices.

Regarding conflicting types using the same field name, Kibana also tracks field types by the index prefix (in our case logstash-*). So while splitting off to a separate index could work around this on the ES side, we could still see type collision issues within Kibana.

Overall I tend to agree that adding an index for this could be helpful to keep field count under control (do you have a sense of how many additional fields we're looking at?) and to limit the impact if name/type collisions do occur. It wouldn't by itself be fully effective in avoiding type conflicts on existing field names, but should help ensure they are at least indexed.

With regard to schema, here is some high level documentation about the elastic common schema that we're moving towards https://www.elastic.co/guide/en/ecs/current/ecs-reference.html. Ultimately we want to converge on a standard set of field names and types.

Happy to discuss further! FWIW we do have an "o11y office hours" open slot on Mondays at 11:30 Eastern, if that would work schedule-wise.

Happy to discuss further! FWIW we do have an "o11y office hours" open slot on Mondays at 11:30 Eastern, if that would work schedule-wise.

I'll try to make this on the 9th.

Not sure I can capture the whole discussion but I'll try:

  1. For Mediawiki client error logging, it would be nice to have those events presented alongside Mediawiki application logs. This suggests they should be in the same ES index, which means that we should try to make the MW events conform to Elastic Common Schema.
  2. For NEL, we have no control over the original events themselves. It's also somewhere between 'fine' and 'desirable' to have them in a separate ES index. So let's do that, using jsonschema_to_template to create the index template for such (which should also address T266906 and a few similar issues with using NEL data at present).

In the long run we don't want to have too many cases that look like #2, as maintaining new ES indexes incurs some toil cost. But a small handful of these (@colewhite suggested "10 indexes in the next 5 years") is acceptable. observability should review new index proposals.

Also, all of the above is blocked on the ELK 7 migration, which is itself blocked on a new release of Kibana to address some performance issues...

Quick question: when the time comes, will it be possible to dump all the old NEL events out of the existing index and import them into the new index?

Quick question: when the time comes, will it be possible to dump all the old NEL events out of the existing index and import them into the new index?

It's not a super intuitive process, but it is possible.

Change 657452 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: send w3creportingapi logs to indexes with custom schema

https://gerrit.wikimedia.org/r/657452

In a meeting with devs doing client error logging today, we realized that conforming to ECS before migrating to an ECS only index is not possible, e.g. url as an object will conflict with existing url fields in the default logstash index. For now, we will move forward with changes to mediawiki/client/error schema that don't conflict with the default index now, and follow up with ECS migration/integration at a later date.

@jlinehan @Jdlrobson @Tgr @Krinkle @phuedx comment if that summary needs correction or clarification.

In a meeting with devs doing client error logging today, we realized that conforming to ECS before migrating to an ECS only index is not possible, e.g. url as an object will conflict with existing url fields in the default logstash index.
For now, we will move forward with changes to mediawiki/client/error schema that don't conflict with the default index now, and follow up with ECS migration/integration at a later date.

If I understand this correctly, client errors will be logging to the current logstash indexes and no longer needs a schema.

If that's the case, we'll mark this task as complete when the w3creportingapi stream is migrated to its new indexes.

Change 657452 merged by Cwhite:
[operations/puppet@production] profile: send w3creportingapi logs to indexes with custom schema

https://gerrit.wikimedia.org/r/657452

colewhite claimed this task.

w3creportingapi logs are now in their custom schema.

Krinkle reopened this task as Open.EditedFeb 8 2021, 7:15 PM

[…] For Mediawiki client error logging, it would be nice to have those events presented alongside Mediawiki application logs. This suggests they should be in the same ES index, […]

What does "presented alongside" mean? I don't expect JS and PHP errors to ever be displayed on the same Kibana dashboad as they have inherently store very different data and fields that cannot usefully be joined in my opinion. As far as I know is asking for or expecting this to be possible, and likely isn't something we'd attempt to do even if it were possible.

Client erorr logs do not come from the "mediawiki" application.

The problem this task was filed for is that new pipelines that ingest EventGate data into Logstash are causing conflicts with the default/mediawiki indexes in Logstash. The suggestion was for all EventGate-related data to be in its own Logstash index. Thus making it easier to avoid conflicts for developers since they only have to avoid conflicts with other EventGate schemas, which are all maintained in the same repository with at least one common team supporting/overlooking those developments.

MediaWiki log messages use free-form key-value pairs coming from 1000+ different repositories currently. So long as this remains the case, client/error can and will conflict again in the near future, especially if we insist on using top-level keys with generic names that contain objects rather than strings.

I don't have an opinion on how to solve this. It seems to me like gradual adoption of ECS would be simpler in a separate index than to try and make things work in the hostile default/mediawiki index. Either a future generic index for all ECS stuff, or for the subset of it for all EventGate stuff. Just thinking out loud.

If the issue is denied or considered worthwhile to just deal with week after week, we could decline the task.

I don't expect JS and PHP errors to ever be displayed on the same Kibana dashboad as they have inherently store very different data and fields that cannot usefully be joined in my opinion.

They can be usefully joined over reqId (I'm not sure we are logging that for client errors now but it would make sense), to help debug weirder JS errors which are related to some unusual data export caused by PHP-side problems. It's a somewhat fringe use case though.

In any case I don't see how ECS would prevent conflicts. It's basically a type system, with types being arrays with predefined keys. We'd still have to make sure top-level fields always have the same type, which is not that different from the problem we have now. A more low-tech solution of simply having a registry of field names/types on some wiki page would work better, IMO.

I don't expect JS and PHP errors to ever be displayed on the same Kibana dashboad as they have inherently store very different data and fields that cannot usefully be joined in my opinion.

They can be usefully joined over reqId (I'm not sure we are logging that for client errors now but it would make sense), to help debug weirder JS errors which are related to some unusual data export caused by PHP-side problems. It's a somewhat fringe use case though.

Aye, yeah, at a global level when debugging it would be good to be able to explore/discover things in a way that allows you to query for anything on host:mw1234 regardless of which service or layer the data came from for a certain timerange. And similarly by request ID across different services like Varnish, Nginx, MediaWiki PHP, and JS client erorrs.

I don't imagine useful aggregations will be built across that in a dashboard as there'd be very few common fields apart from timestamp, message and whichever field we filtered by. But that's okay I guess. It's just to get a timeline with some raw messages to debug a specific issue.

As far as I know, it is actually very possible in Kibana to query across different indexes. For example, mediawiki-error messages have their own index as logstash-deploy-YYYYMMDD, and other mediawiki core messages (which often use arbitrary key-value pairs that easily conflicts with other services) have another dedicated index as logstash-mediawiki-YYYYMMDD. This was established in T234564.

Yet, from what I can tell, one can query across all of mediawiki-deploy-* (errors) mediawiki-* (misc), and other things in the default index like nginx/apache/ntp/scap/syslog on app servers. Kibina allows this both on dashboards and in the Discover feature.

In any case I don't see how ECS would prevent conflicts. It's basically a type system, […]

I chatted about this earlier today with @Ottomata. Indeed, it still leaves a lot of ambiguity over how to name top-level fields.

What does "presented alongside" mean?

The ability to issue a single query and have the results presented merged in a single view.

For example, were we to place client-side errors in a separate index pattern (think eventplatform-*) and wanted to view both mediawiki and client-side errors with the "Discover App", one would need to open two tabs: one showing only the mediawiki logs and another showing only the client-side errors. Merging would be up to the user to switch between tabs and observe timestamps to identify flow.

As far as I know, it is actually very possible in Kibana to query across different indexes.

Indeed, it is possible for indexes of the same index pattern and mapping.

Analysis is impossible for fields with type conflicts across indexes in the same index pattern. We now have 18 fields in the logstash-* index pattern with this issue.

Every field that reaches ES is fully text analyzed and indexed like language content and indexed as a keyword. This cuts us off from richer query expressions as well as contributing to heavy indexes. Lighter indexes means more free memory for cache.

Each partition we add (think logstash-deploy-*) makes for 90 more open indexes. Each open index requires memory even if it is not being used. Fewer open indexes means more free memory for cache.

We are regularly feeling the pain of an undefined mapping. There is no guidance defining what is correct input for a given field and each type conflict leads to data loss. The type of any given field is determined by the first document ingested with that field defined at 00:00 UTC each day. If two producers leverage the same field and are in conflict, only one can "win" and for that day the other log producer's logs are lost. Identifying type conflicts is a manual process. When type conflicts occur, alarms go off and initiate a response from SRE. Getting help to resolve type conflicts in a timely manner has been hit-or-miss.

A more low-tech solution of simply having a registry of field names/types on some wiki page would work better, IMO.

We have 9404 fields (today) in the logstash-* index pattern with 18 currently in conflict. Defining and documenting each field would be a long and painful process and ultimately not free us from the duty of migrating log producers to adopt the new standard. With a pre-defined schema, we trade the process of identifying, documenting, and gathering consensus for each field for a more involved migration process. At the same time ECS gives us a workflow for schema changes and reactive documentation for free.

Indeed, it still leaves a lot of ambiguity over how to name top-level fields.

ECS is a permissive schema and is not intended to prevent type conflicts. Its purpose is to define and document fields, and generate the schema installed in ES. If a new field is needed, then it should have a type definition and documentation outside of the log producer's codebase so that other users can know what any field is and its appropriate type. The one exception is the "labels" top-level field which is a permissive Keyword:Keyword object store.

It seems to me like gradual adoption of ECS would be simpler in a separate index than to try and make things work in the hostile default/mediawiki index. Either a future generic index for all ECS stuff, or for the subset of it for all EventGate stuff. Just thinking out loud.

ECS-formatted log entries use the ecs-* index pattern and as such will not conflict with logstash-* field definitions because ECS logs are redirected to the new index pattern. We recommend all new log producers adopt ECS.

Please feel free to reach out if more information is needed.