Page MenuHomePhabricator

Client-side error logging should use Elastic Common Schema (ECS) fields when possible
Open, LowPublic

Description

We met with @Ottomata who brought to light the work on client side error logging. It was indicated that this task should be filed to serve as notice about upcoming changes to logstash that potentially affect this work.

As part of T234565, logstash will adopt Elastic Common Schema as the schema for log events. Client error logging should attempt to use ECS-defined fields when possible to stem the probability of dropped fields and ease the migration of the stream to the new schema once ratified.

Unfortunately, ECS cannot be adopted fully until the legacy logstash cluster is decommissioned due to mapping conflicts with the current mapping configuration. Looking at jsonschema/mediawiki/client/error/1.2.0.yaml, it looks like only the url field is affected.

Event Timeline

I think the http field might also be affected, and that one will be a bit trickier to reconcile.

Our Event Schema:
https://schema.wikimedia.org/repositories//primary/jsonschema/mediawiki/client/error/1.1.0

ECS http:
https://doc.wikimedia.org/ecs/#ecs-http

Mholloway renamed this task from Client-side error logging should use ECS fields when possible to Client-side error logging should use Elastic Common Schema (ECS) fields when possible.Nov 16 2020, 4:27 PM
fdans triaged this task as Medium priority.Nov 16 2020, 4:35 PM
fdans moved this task from Incoming to Event Platform on the Analytics board.

I think the http field might also be affected, and that one will be a bit trickier to reconcile.

Just talked with @colewhite in IRC.

We'll either need to

A. set up a logstash filter to transform our http object into the ECS http object and run that forever
or
B. Alter out http object common schema to match ECS's.

A. is easy to do now, but requires maintenance and special casing.

B. is hard to do, and requires a lot of coordination. But we could do it slowly one schema at a time, and start with the ones we want to import into logstash. We'd make an fragment/http/2.0.0,...or maybe an fragment/ecs/http/1.0.0, and then include it in mediawiki/client/error. To do this we'd need to make eventgate-wikimedia aware of this new convention and set the fields appropriately. Ungh, and if we hoped to eventually migrate ALL existent schemas to ECS's http, the Hive tables would have both http subschema fields (e.g. http.request_headers and http.request.headers) probably forever (unless we manually intervened).

@jlinehan @Mholloway, thoughts?

I'm not sure what is best.

B. is hard to do, and requires a lot of coordination. But we could do it slowly one schema at a time, and start with the ones we want to import into logstash. We'd make an fragment/http/2.0.0,...or maybe an fragment/ecs/http/1.0.0, and then include it in mediawiki/client/error. To do this we'd need to make eventgate-wikimedia aware of this new convention and set the fields appropriately. Ungh, and if we hoped to eventually migrate ALL existent schemas to ECS's http, the Hive tables would have both http subschema fields (e.g. http.request_headers and http.request.headers) probably forever (unless we manually intervened).

What if we create an ECS-specific schema that has everything laid out exactly the way ECS would want it laid out? ECS from what I can tell is a one-schema-to-rule-them-all approach, so in *theory*, having one ECS schema would cover everything. We could then just have a client_error stream, which is using the ECS schema.

Are we planning to have a level of compatibility between events going into Logstash and into other back-ends?

Interesting idea! However, there are some Event Platform specifics that we'd need to handle, mainly meta.stream, meta.dt, $schema, http.client_ip (not in this schema) and http.request_headers['user-agent']. These are all touched by EventGate and/or the Hive ingestion pipeline.

We can't do much about $schema and meta.* fields, but we could potentially refactor all schemas to conform to ECS for http.* and also any other future conventions we might need to adopt.

Refactoring http.* would be a lot of work, but not toooooooo bad. We'd probably have to have EventGate and Hive ingestion support both formats for a very long time, and fill in e.g. both request_headers['user-agent'] and request.headers['user-agent'] if they exist. We already do something similar to handle the differences in legacy EventLogging schemas, I guess we can just keep tacking on more conditional logic. :/

@jlinehan @Ottomata see we haven't touched this ticket in December? Anything we need to action here or can we close out?

I can't recall if we made any real decisions on what to do here. There is an issue with what ElasticSearch index is used for event streams that go there; we need to make sure that any given index doesn't have field naming conflicts. This means that we can't use the ElasticSearch index that is used for regular logs for the event platform streams; as the http field (and maybe others?) conflict. There was an idea of making a dedicated index for Event Platform streams, but I'm not sure if that was agreed upon.

I'd personally prefer not to change existent Event Platform schema fields to conform to ECS (BTW there is also a similar effort to rename some things driven by the Arch and Enterprise teams), but we should do our best to conform as much as possible for new fields, especially those in Metrics Platform's 'monoschema'.

Thinking back to the last discussion, I was under the impression it was better to update the eventstreams schema to closer fit ECS sooner rather than later. The risk, IIRC, was it would be more difficult to change later once adoption had ramped up.

Some change had been proposed but it hasn't seen much love for a while: https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/647025

Are we past the point where the schema can be amended? Is there external schema work ("monoschema"?) pushing for a specific way of organizing log data?

There are still things we can do on our end to help, but I am out of the loop if the plan has evolved since December.

Right, I think the work on that just stalled and never got done. @jlinehan?

That patch still will still have a conflict with the http field. That one is harder to resolve since it is used by a lot of other event schemas too. Am remembering now, this patch was to get the client error logging schema as in line as possible.

ldelench_wmf lowered the priority of this task from Medium to Low.Aug 23 2021, 2:14 PM

Does T272238: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence affect whether we want to move forward with this?

At this time, the Observability team has no plans to abandon the ECS project.