Page MenuHomePhabricator

Enable 'skin' dimension using stream configuration
Closed, ResolvedPublic

Description

Context

We plan to have the capability to source data for the most commonly-used dimensions. For this to be possible, we will need to

  1. Provide a way for the stream configuration to signal which dimensions should receive data
  2. Provide a way for the client library to fulfill this request by e.g. calling methods which supply the required values

We will use the session length instrument and the 'skin' dimension as a POC.

Questions to answer
  • What does the stream configuration syntax for this look like?
  • What should the schema look like to accommodate this?

Event Timeline

sdkim updated the task description. (Show Details)
jlinehan renamed this task from [SPIKE] Session Length of Logged in/Logged out users to [SPIKE] Enable 'skin' dimension using stream configuration.Jan 21 2021, 6:20 PM
jlinehan triaged this task as Medium priority.
jlinehan updated the task description. (Show Details)

Straw concept:

Perhaps we could create one or more fragment/analytics/.../dimensions schema fragments that could be included in the relevant schemas, adding a top-level dimensions object with properties to be toggled by stream configuration? For example, skin is only a relevant concept for MediaWiki clients and not the apps, so we could create fragment/analytics/mediawiki/dimensions with a dimensions object with an optional, string-typed skin field, and add a $ref to it in analytics/session_tick.

Configuration could look like:

'wgEventStreams' => [
  [
    'stream' => 'mediawiki.client.session_tick',
    ...
    'dimensions' => [
      'skin',
    ],
  ],
],

Client libraries could be updated to understand the possible values included in the dimensions config and include them in a top-level dimensions object if present in the stream config.

Change 658677 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[schemas/event/secondary@master] Create fragment/analytics/dimensions/mediawiki

https://gerrit.wikimedia.org/r/658677

FYI, this is pretty similar to T263466: EventGate idea: use presence of schema properties in http.(request|response)_headers to automatically set header values in event data and T263672: Figure out where stream/schema annotations belong (for sanitization and other use cases), using stream configuration to make clients conditionally set values. E.g. I'd much rather be using stream config to tell eventgate to populate e.g. http.client_ip then using the existence of that field in the schema to do that.

We never really figured out a good smart way to do this in stream config. Is it worth trying to solve this for all these use cases, or is that too cumbersome?

We never really figured out a good smart way to do this in stream config. Is it worth trying to solve this for all these use cases, or is that too cumbersome?

Ya this is pretty much a pilot for solving the general case, so we can use this to understand how we'd tackle in general. I think the two main things are to fix a syntax for the stream config and to fix a fulfillment mechanism for providing that value in the client (in this case) or eventgate (in the case of e.g. http.client_ip etc.). I think it will help if we come together and draft a nice simple syntax for stream config, that will give us a plan to build towards. Speaking at least from the analytics perspective, streams are probably the most valuable and least-understood part of the new system, so stream config is a key interface. See also T269774: [MEP] Determine how stream configuration is authored and deployed. We can unify all these under an epic dedicated to stream config but this is a spike to demonstrate feasibility.

Change 659412 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[schemas/event/secondary@master] Add fragment/analytics/dimensions/mediawiki to analytics/test

https://gerrit.wikimedia.org/r/659412

@Mholloway in yesterday's EP sync, Jason and thought it would be worth some time to think about how to structure stream config in a slightly generic to support configuration of clients to modify events, so we can solve both this and T263466 together.

Something in stream config sorta like:

event_hydrations:
  producer_client: 
    mediawiki_skin:
      enabled: true
      field: dimensions.skin
  intake_service:
    request_headers:
      user-agent:
        enabled: true
        field: http.request_headers.user-agent
      x-client-ip:
        enabled: true
        field: http.client_ip

(actual naming and structure to be put through the bike shed design committee :) )

@Ottomata Nice. That makes sense to me.

Is there any advantage to having an enabled flag for each field over just including in/excluding from the config?

  • How should the client library be structured?

This will not require any structural change to the client libraries. We'll simply be adding a processing step, after sampling config evaluation and before event submission, to gather configured analytics dimensions data from the environment and add them to the event data object.

Is there any advantage to having an enabled flag for each field over just including in/excluding from the config?

I dunno! Maybe not!

For what it's worth, I think we're freer to change around stream config syntax at will than we are schemas (at least for web clients, not really for apps, but we're talking about a web-specific property here), so I don't think we need to block this on nailing down stream config syntax once and for all.

Re 'dimensions'; this sounds a lot like labels from ECS. Could this be simplified by just adding a labels map type field? Then you could set:

labels:
  mediawiki_skin: vector
  other_dimension: thing here

without having to worry about complex schema changes.

Having a single field for this might also help with automation of ingestion into time series DBs that use dimensions/labels like Druid and Prometheus. If this is for session_tick, we could probably make a prometheus bridge of some kind that would allow you to make a nice realtimey Grafana dashboard based on labels and some other metrics.

I would support using a map-typed labels field for this. As described in the ECS spec it fits our intended purpose rather well, I think, and it gets us around the question of what does and doesn't constitute metadata. What do you think, @jlinehan?

Mholloway renamed this task from [SPIKE] Enable 'skin' dimension using stream configuration to Enable 'skin' dimension using stream configuration.Feb 1 2021, 8:05 PM
Mholloway updated the task description. (Show Details)

Removed the [spike] label as this is really more of a bikeshed task. The implementation will be straightforward.

I think for this spike it's okay to just name it something and go from there. I think lablels is not very suggestive of either purpose or provenance in the way that meta(data) is, but this task should *not* block on that bikeshed, for sure. Also, colliding with ECS names could be complicated, as the error logging thing showed.

Also, this takes us back to where we started with attempting to do this with a field that required no schema change. If we recall there was concern from some analysts that we were giving up too much in terms of type-checking, so we opted to use optional fields in a schema instead, which is a good medium.

That said, for this PoC, we can just use test, it is a map type and it already exists! See https://schema.wikimedia.org/repositories//secondary/jsonschema/analytics/session_tick/current.yaml

Change 659412 abandoned by Mholloway:
[schemas/event/secondary@master] Add fragment/analytics/meta/mediawiki to analytics/test

Reason:
No need to add a field for now. We'll leverage the test field in analytics/session_tick for demo purposes.

https://gerrit.wikimedia.org/r/659412

Change 658677 abandoned by Mholloway:
[schemas/event/secondary@master] Create fragment/analytics/meta/mediawiki

Reason:
No need to add a field for now. We'll leverage the test field in analytics/session_tick for demo purposes.

https://gerrit.wikimedia.org/r/658677

Change 660952 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/EventLogging@master] Demo inclusion of client-provided data based on stream config

https://gerrit.wikimedia.org/r/660952

Change 660952 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@master] Demo inclusion of client-provided data based on stream config

https://gerrit.wikimedia.org/r/660952

Change 661975 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/EventLogging@master] MEP: Demo support for including skin and login state via config

https://gerrit.wikimedia.org/r/661975