Page MenuHomePhabricator

Client-side error logging should use Elastic Common Schema (ECS) fields when possible
Open, MediumPublic

Description

We met with @Ottomata who brought to light the work on client side error logging. It was indicated that this task should be filed to serve as notice about upcoming changes to logstash that potentially affect this work.

As part of T234565, logstash will adopt Elastic Common Schema as the schema for log events. Client error logging should attempt to use ECS-defined fields when possible to stem the probability of dropped fields and ease the migration of the stream to the new schema once ratified.

Unfortunately, ECS cannot be adopted fully until the legacy logstash cluster is decommissioned due to mapping conflicts with the current mapping configuration. Looking at jsonschema/mediawiki/client/error/1.2.0.yaml, it looks like only the url field is affected.

Event Timeline

I think the http field might also be affected, and that one will be a bit trickier to reconcile.

Our Event Schema:
https://schema.wikimedia.org/repositories//primary/jsonschema/mediawiki/client/error/1.1.0

ECS http:
https://doc.wikimedia.org/ecs/#ecs-http

Mholloway renamed this task from Client-side error logging should use ECS fields when possible to Client-side error logging should use Elastic Common Schema (ECS) fields when possible.Nov 16 2020, 4:27 PM
fdans triaged this task as Medium priority.Nov 16 2020, 4:35 PM
fdans moved this task from Incoming to Event Platform on the Analytics board.

I think the http field might also be affected, and that one will be a bit trickier to reconcile.

Just talked with @colewhite in IRC.

We'll either need to

A. set up a logstash filter to transform our http object into the ECS http object and run that forever
or
B. Alter out http object common schema to match ECS's.

A. is easy to do now, but requires maintenance and special casing.

B. is hard to do, and requires a lot of coordination. But we could do it slowly one schema at a time, and start with the ones we want to import into logstash. We'd make an fragment/http/2.0.0,...or maybe an fragment/ecs/http/1.0.0, and then include it in mediawiki/client/error. To do this we'd need to make eventgate-wikimedia aware of this new convention and set the fields appropriately. Ungh, and if we hoped to eventually migrate ALL existent schemas to ECS's http, the Hive tables would have both http subschema fields (e.g. http.request_headers and http.request.headers) probably forever (unless we manually intervened).

@jlinehan @Mholloway, thoughts?

I'm not sure what is best.

B. is hard to do, and requires a lot of coordination. But we could do it slowly one schema at a time, and start with the ones we want to import into logstash. We'd make an fragment/http/2.0.0,...or maybe an fragment/ecs/http/1.0.0, and then include it in mediawiki/client/error. To do this we'd need to make eventgate-wikimedia aware of this new convention and set the fields appropriately. Ungh, and if we hoped to eventually migrate ALL existent schemas to ECS's http, the Hive tables would have both http subschema fields (e.g. http.request_headers and http.request.headers) probably forever (unless we manually intervened).

What if we create an ECS-specific schema that has everything laid out exactly the way ECS would want it laid out? ECS from what I can tell is a one-schema-to-rule-them-all approach, so in *theory*, having one ECS schema would cover everything. We could then just have a client_error stream, which is using the ECS schema.

Are we planning to have a level of compatibility between events going into Logstash and into other back-ends?

Interesting idea! However, there are some Event Platform specifics that we'd need to handle, mainly meta.stream, meta.dt, $schema, http.client_ip (not in this schema) and http.request_headers['user-agent']. These are all touched by EventGate and/or the Hive ingestion pipeline.

We can't do much about $schema and meta.* fields, but we could potentially refactor all schemas to conform to ECS for http.* and also any other future conventions we might need to adopt.

Refactoring http.* would be a lot of work, but not toooooooo bad. We'd probably have to have EventGate and Hive ingestion support both formats for a very long time, and fill in e.g. both request_headers['user-agent'] and request.headers['user-agent'] if they exist. We already do something similar to handle the differences in legacy EventLogging schemas, I guess we can just keep tacking on more conditional logic. :/