Page MenuHomePhabricator

Define acceptable usage of the `meta` object in event schemas
Closed, ResolvedPublic

Description

The fragment/common fragment, defined in the primary schema repository, defines a meta object which it indicates that all schemas should have. There has been persistent confusion about what this object is and is not for. Several subfields of meta are currently populated by eventgate, including:

  • meta.stream: The stream to which the event was produced
  • meta.id: A unique ID identifying the event
  • meta.dt: A timestamp reflecting when the event was received by eventgate
  • meta.request_id: A unique ID identifying the request that caused the event

What is the intended scope of meta? Is it acceptable to add additional meta fields for analytics instrumentation — a meta.mediawiki_skin field for analytics "dimensions," for example (T271456)? If so, where should these be added? Do they need to be added in the primary schema repository, or can meta be supplemented from a schema or fragment in the secondary repo?

Event Timeline

meta is a vestigial historical compromise, and if we could get rid of it I would. It isn't impossible to get rid of it, it would just be a bit of work.

I'd prefer not to add more stuff into meta if we can avoid it.

— a meta.mediawiki_skin field for analytics "dimensions," for example (T271456)? If so, where should these be added? Do they need to be added in the primary schema repository, or can meta be supplemented from a schema or fragment in the secondary repo?

Most likely information like mediawiki skin will be used for analytics / instrumentation purposes, so I think putting this in secondary is fine. We could put it in similar namespace like other fragment/mediawiki stuff though, e.g. fragment/mediawiki/skin?

meta is a vestigial historical compromise, and if we could get rid of it I would. It isn't impossible to get rid of it, it would just be a bit of work.

Ya, we're re-re-re-visiting this haha. Background: we're looking to move towards some vocabulary that differentiates between 'data' and 'metadata', where 'data' is the stuff that is set explicitly by an instrument, and 'metadata' is stuff that is set by other services such as the client libraries, event gate, etc. The way we talk about this distinction today is either as a 'capsule' (a memory of the event capsule from the last system) or 'dimensions' (a BI term with lots of different interpretations that sounds more complicated than it is). I've heard from multiple people that it's a bit confusing to keep track of what we're talking about. OTOH, the data/metadata distinction is something most people are familiar with, and seems natural.

Given that we talk in terms of data/metadata, the meta field would be the ideal place to put these fields, since it's short and descriptive. Unfortunately it's also used for these other things that EventGate likes to have. But even if it weren't used for those other things, I think I could see us opting to use a field called 'meta' to organize the fields we're thinking of adding. Soo.. what would we need to do to make meta's current use case a better home in a different field (or fields) if that is more appropriate? Or, would there be real harm in homing them together?

BTW, here's the first reference to meta I can find: https://github.com/wikimedia/restevent/pull/5/files


Hm, I like the motivation here: somehow clearly delineating what can/should be set by instruments and what is set by libraries. I think in practice this is going to be hard, but we can do our best.

Can we use something other than meta? I think term 'meta' or 'metadata' here is pretty overloaded and extra confusing, and if I were to try to make some differentiation between metadata and data, meta certainly would not respect that difference. maybe meta.stream is metadata, but I'd say meta.domain and meta.dt are data. So yeah It sucks.

Let's discuss more, but I think using a schema field to indicate what is set by what software isn't going to be easy, the line is always fuzzy, and there will always be many steps in the pipeline that might mutate an event. Would good documentation help? Some standardization in a field's description? Or maybe schema annotation? like in T263672: Figure out where stream/schema annotations belong (for sanitization and other use cases)? This issue feels a bit similar to Mikhael's schema namespacing idea. We want to make things less confusing to devs by drawing hard lines and making categories, but those lines and categories are never going to be truly coherent. Maybe tagging/annotating couple with better documentation and UIs (in a data governance tool?) would be simpler?

Anyway, all that said, I'm not opposed to a fragment field in secondary somewhere that is for use by instrumentation client libraries only. Name to be bike shed I guess, I'd say avoid 'meta' and probably also 'dimension', but we can surely find something that would work and make sense.

I'm also not opposed to doing the work to get rid of meta altogether, I just don't think it's a priority. Perhaps better would be just to improve documentation and descriptions?

BTW, here's the first reference to meta I can find: https://github.com/wikimedia/restevent/pull/5/files


Hm, I like the motivation here: somehow clearly delineating what can/should be set by instruments and what is set by libraries. I think in practice this is going to be hard, but we can do our best.

Can we use something other than meta? I think term 'meta' or 'metadata' here is pretty overloaded and extra confusing

I think I'd disagree, in terms of comprehensibility, I think this distinction is pretty common. Can you give an example maybe? There is the thing-being-measured (data), and the things-about-the-thing-being-measured (metadata), including any properties of the data itself, but also contextual information about who, what, when, why, and how it was created. This intuition seems to match what we're trying to acheive, by having instruments focus on only the thing-being-measured.

Let's discuss more, but I think using a schema field to indicate what is set by what software isn't going to be easy, the line is always fuzzy, and there will always be many steps in the pipeline that might mutate an event.

This seems like a case where the distinction that's high-importance is the one for the end-user of "do I set this, or do you set this?" In that sense, I agree that doing anything beyond that (i.e., does EventGate set this? Or the client library?) in the schema is probably a recipe for trouble, given that, like you're saying, these contracts may change, or indeed a later stage may mutate a value from a prior stage. It isn't really in the end-user's interest to know or care about how the values get there specifically, so long as they do get there.

Would good documentation help? Some standardization in a field's description?

Since we will be using fragments, I think this is probably sufficient, since we'll write it once and only once.

Anyway, all that said, I'm not opposed to a fragment field in secondary somewhere that is for use by instrumentation client libraries only. Name to be bike shed I guess, I'd say avoid 'meta' and probably also 'dimension',

I think meta works well here, since the other metadata fields are already in meta, and again, my preference for data/metadata semantics. Is the reason to avoid meta to avoid namespace collisions?

Perhaps better would be just to improve documentation and descriptions?

I think descriptions are sufficient for the data lineage part of what software sets values for a field, but I don't think it's sufficient in practice to dstinguish between fields that are set automatically by some process, which may be configured by the user via a stream config, and fields that are set by the instrument. There are a few areas where this information is helpful: 1) writing queries, i.e. column names, 2) reading materialized schema, 3) examining raw events. In these situations, it will be helpful to be able to quickly distinguish between what is set automatically and what is set manually, so having a sub-object for that is useful since it namespaces all the automatically set things. It also ensures that the automatic things are collated together, and not peppered throughout a schema. Keeping the name of this field short and descriptive is a priority then, which again is why I feel meta is a good candidate.

Can you give an example maybe? There is the thing-being-measured (data), and the things-about-the-thing-being-measured (metadata), including any properties of the data itself, but also contextual information about who, what, when, why, and how it was created.

I'd say meta-data is data about the data. The example I gave was meta.stream could probably be metadata. So could data ownership as you say. But things like timestamps and ids and domains and mediawiki skins are actual data, not data about the data.

I think meta works well here, since the other metadata fields are already in meta, and again, my preference for data/metadata semantics. Is the reason to avoid meta to avoid namespace collisions?

meta is already confusing and inaccurate. We can't easily change it without an extensive migration process. It is used by all schemas, including production ones, and I think any variances in it in different schemas will only add to confusion. Adding meta.mediawiki_skin to all schemas is not right, and only having it in some (how? with allOf merging + $refing?) but not others is also confusing.

In that sense, I agree that doing anything beyond that (i.e., does EventGate set this? Or the client library?) in the schema is probably a recipe for trouble, given that, like you're saying, these contracts may change, or indeed a later stage may mutate a value from a prior stage.

Another example of where this distinction is not clear: the MediaWiki PHP intentionally sets the http.request_headers['user-agent'] field, since it is essentially proxying the event produce request to EventGate on behalf of the original HTTP request from a client. So here, the 'client library' is setting this, but in other cases, EventGate is setting it. In other cases, the instrumentation might need to set this themselves, e.g. if there is some python instrumentation that is POSTing to EventGate using python requests library, the user agent will need to be set to something different than the default user agent that python requests sets.

it will be helpful to be able to quickly distinguish between what is set automatically and what is set manually, so having a sub-object for that is useful since it namespaces all the automatically set things. It also ensures that the automatic things are collated together, and not peppered throughout a schema. Keeping the name of this field short and descriptive is a priority then, which again is why I feel meta is a good candidate.

I like this idea in theory. In practice I'm not sure how we can do it for all things that are set 'automatically', as sometimes a field may be set 'automatically' and in other cases it might not be! It depends on the use case of the client. Another example: meta.id. This id should uniquely identify an event. For instrumentation, this is probably fine to just be a random UUID, but for other cases, there might be some event generator that given certain inputs, should generate the same event with a deterministic meta.id. Or take meta.request_id. This should be used for distributed tracing of the events an originating 'request' might generate. If downstream stream processing is involved, secondary events might be generated, and the meta.request_id should be set appropriately to the same values. E.g. An EditAttemptStep meta.request_id should be propagated to a successful mediawiki.revision-create meta.request_id and then maybe also to a resource_purge meta.request_id.` What is 'responsible' for setting that field? Is it always 'client libraries'?

There will always be many producers using different code and libraries to work with events, and I don't think we're going to be able to keep them all consistent, especially with respect to common field semantics. We can do our best for some, but the fewer we have to deal with the better.

Anyway, in summary, I think meta is an inaccurate name for what you want to do, and it is already taken (and inaccurate for what it is doing).

But, I do think that for schemas that will ONLY be produced by certain clients, in this case clients producing analytics instrumentation for WMF products, it could make sense to create a separate field that should not be touched by instrumentation code, but only by your supported client libraries. This is adding (more) coupling between schemas and code, but it is limited to the specific use case, instead of something shared by all of our schemas.

Can you give an example maybe? There is the thing-being-measured (data), and the things-about-the-thing-being-measured (metadata), including any properties of the data itself, but also contextual information about who, what, when, why, and how it was created.

I'd say meta-data is data about the data. The example I gave was meta.stream could probably be metadata. So could data ownership as you say. But things like timestamps and ids and domains and mediawiki skins are actual data, not data about the data.

IMO the data is the thing-being-measured. If the stream is measuring button clicks, then the button click is the data. The time of the click, user who clicked, page it was clicked on, skin at the time of the click, session id, stream for the data, etc., are all metadata. See https://en.wikipedia.org/wiki/Metadata. To use an analogy, EXIF is metadata about a photo. The photo is the data; the author, location, camera settings, etc. are metadata. I think in the past we have collected things like skin, session_id, etc. in the same way that we collect data -- i.e., the instrument/caller does it. But now -- again like EXIF -- the user is only responsible for "taking the picture", and the metadata will be added as desired by the "camera".

The interpretation of metadata purely as administrative facts about the data in order to aid the data handling apparatus (an example would be an HTTP header, for the most part) seems to be what you're using, but I think this is too narrow, or rather, it forms a subset of metadata.

I think meta works well here, since the other metadata fields are already in meta, and again, my preference for data/metadata semantics. Is the reason to avoid meta to avoid namespace collisions?

meta is already confusing and inaccurate. We can't easily change it without an extensive migration process. It is used by all schemas, including production ones, and I think any variances in it in different schemas will only add to confusion. Adding meta.mediawiki_skin to all schemas is not right, and only having it in some (how? with allOf merging + $refing?) but not others is also confusing.

Isn't it possible to "fork" the meta fragment and use a different one for the secondary repo? I don't think the proposal is to add new fields to all production schemas. These fields are just names, not actual entities. We all agree meta as it stands today is confusing and inaccurate, the proposal here is to use it in a way (at least in the secondary repo) that isn't those things, rather than make another confusing/inaccurate field because the preferred one (meta) was "taken?"

it will be helpful to be able to quickly distinguish between what is set automatically and what is set manually, so having a sub-object for that is useful since it namespaces all the automatically set things. It also ensures that the automatic things are collated together, and not peppered throughout a schema. Keeping the name of this field short and descriptive is a priority then, which again is why I feel meta is a good candidate.

I like this idea in theory. In practice I'm not sure how we can do it for all things that are set 'automatically', as sometimes a field may be set 'automatically' and in other cases it might not be! It depends on the use case of the client. Another example: meta.id. This id should uniquely identify an event. For instrumentation, this is probably fine to just be a random UUID, but for other cases, there might be some event generator that given certain inputs, should generate the same event with a deterministic meta.id. Or take meta.request_id. This should be used for distributed tracing of the events an originating 'request' might generate. If downstream stream processing is involved, secondary events might be generated, and the meta.request_id should be set appropriately to the same values. E.g. An EditAttemptStep meta.request_id should be propagated to a successful mediawiki.revision-create meta.request_id and then maybe also to a resource_purge meta.request_id.` What is 'responsible' for setting that field? Is it always 'client libraries'?

I'm not sure I follow. We define metadata to be fields meta.*, defined by some schema fragment. A stream configuration provides a request contract for metadata fields. Conformant clients must fulfill the stream configuration's request. In the design of the client library, this fulfillment will take place within the library automatically, and so the library will advise that these fields should not be set by callers, and setting them may result in undefined behavior. This doesn't prevent other clients (or callers who forego the client library) from fulfilling the stream configuration's request in their own way.

There will always be many producers using different code and libraries to work with events, and I don't think we're going to be able to keep them all consistent, especially with respect to common field semantics. We can do our best for some, but the fewer we have to deal with the better.

The goal is to define a set of common semantics for the producer code used by WMF products, in particular the analytics clients we're building. It's possible to keep these consistent, and part of our mission. That is the level this proposal exists at, I don't think the proposal is to define behavior at a higher level than that.

Anyway, in summary, I think meta is an inaccurate name for what you want to do, and it is already taken

Again, disagree on both counts.

We are having a fun little philosophical argument, eh? :)

IMO the data is the thing-being-measured

I agree with this

The time of the click, user who clicked, page it was clicked on, skin at the time of the click, session id, stream for the data, etc., are all metadata.

I disagree with this.

To use an analogy, EXIF [...]

I think the analogy here doesn't quite fit. The photo data file would do what it is supposed to do without the EXIF information. Instrumentation events would not be useable without timestamps, nor without other extra data thats added to enable other functionalities (ingestion into Hive tables, deduplication, distributed tracing, stats about mediawiki skins, user agents, etc.).

The purpose of instrumentation event data and of production state change event data is to capture information about the event. I'd agree that some of what we add to events is not strictly what the event is defined to be about. For example we have user_edit_count in mediawiki/revision/create events (as well as a lot of other extra info) that is not strictly relevant to the revision-create, but it does add value and is useful. So, while I'd agree that fields like user_edit_count and mediawiki_skin are extra data, they are not 'metadata' in the sense that it is not data about the data.

I think this philosophical argument we are having about 'meta' is maybe due to a disagreement as to where the 'data about data' level jumping line is. To me, the mediawiki skin that is used at the time an event was emitted is at same level as the other information captured in the event. A wholly different level would be things like the team at WMF that manages the stream, or the date the schema was first created, or the phabricator ticket (T271456) that original work to add the mediawiki_skin was tracked in, etc.

In the Metadata wikipedia article you linked, (I did not read the whole thing, just the header), the metadata examples given are all about data about data, not just 'extra data'.

If we were to say that any extra information in an event was 'metadata', then almost all fields could be considered metadata. E.g. in mediawiki/client/error, only the original exception message would be 'data', whereas everything else that we are capturing would be 'metadata'. I don't think this is right.

Isn't it possible to "fork" the meta fragment

Possible yes, but I really don't think we should do this. At the moment, meta + $schema are ubiquitously referenced from every schema we have. Making meta different in some schemas is a bad idea. meta.stream, meta.id, meta.dt (as well as others) are needed for backend internal functionality. If we fork meta, we lose the ability to easily keep track of how these fields are used throughout the schemas. If we ever need to change these, I wouldn't want to have to do so in multiple forks.

If we did the work to get rid of the meta field altogether, I'd only object to your desired use of meta from a philosophical/definitional/bikeshed point of view. But given that it is already used for critical things outside of analytics client libraries, it really should not be used in a specialized way for those analytics client libraries.

I think in the past we have collected things like skin, session_id, etc. in the same way that we collect data -- i.e., the instrument/caller does it. But now -- again like EXIF -- the user is only responsible for "taking the picture", and the metadata will be added as desired by the "camera".

Conformant clients must fulfill the stream configuration's request. In the design of the client library, this fulfillment will take place within the library automatically, and so the library will advise that these fields should not be set by callers, and setting them may result in undefined behavior.

It seems to me that what you want is to use a sub-object field as clear distinction from the caller's point of view as to what fields they should not have to think about setting in their code, and instead use configuration to enable automatic collection of standardized extra data in that sub-object. I see the value in this. It could be practically messy due to reasons I mentioned above, but if restricted especially uses for analytics instrumentation client libraries, I think it would be ok.

I've heard from multiple people that it's a bit confusing to keep track of what we're talking about.

Makes sense. What I'm hearing is that you need a term to refer to 'fields not set by instrumentors'.

The way we talk about this distinction today is either as a 'capsule' (a memory of the event capsule from the last system) or 'dimensions' (a BI term with lots of different interpretations that sounds more complicated than it is).

My understanding of dimension is that it is not the term you are looking for. A dimension is just a field with limited cardinality, suitable for aggregations like grouping and counting, like 'domain' or 'page title', contrasted with a metric or a measure, like 'time spent on page'.

the data/metadata distinction is something most people are familiar with, and seems natural.

Hm. People may be familiar with the terms data and metadata , but I really don't think that which part of the event produce pipeline sets which fields is what makes that distinction.

Perhaps something more descriptive? Q: Are all the extra data you want to collect going to truly be dimensions? If so, going with something like the labels map field idea might work well. All the value types would be strings, and you wouldn't have to pre-define the field name (map keys) you want to collect in the schemas. You'd only have to write the code to set the keys and values in the labels map based on stream configuration.
If you really want these labels to never be touched by instrumentors, I suppose a more descriptive name would help? automatic_labels extra_labels or some other beautiful bike shed? :)

Hm, alternatively, perhaps the 'capsule' idea is actually useful here. Legacy EventLogging schemas have the event which is what is given to the now deprecated mw.eventLog.logEvent function. From the point of view of users that are calling logEvent, they don't see the 'capsule' fields, (which are now set by the client library, instead of the eventlogging-processor backend). Maybe this kind of distinction is good to continue to use for analytics events? There is a sub-object field in all analytics event that is where instrumentors are to stick their instrumentation data. Anything outside of that field is subject to use or augmentation by the client library and other downstream parts of the pipeline. I don't love event as the name of this field (its a bit confusing having an event in an event), but if you wanted to keep using it for consistency I won't object.

@Ottomata, @Mholloway and I had a chance to sit down and dive into this, notes from our discussion:

  • Likely undesirable to namespace all non-producer-managed fields (what we refer to as 'metadata' in above discussion) under any common namespace, meta.* or otherwise
    • Main reason: that namespace would need to be known to all elements of the data pipeline which set these field values
    • A body of pre-existing fields in the top level exists which would be undesirable to move/replicate, necessitating changes in behavior at various pipeline stages
    • Possible solution: indirection layer to define managed fields as 'entities' (e.g. user_agent) which are then mapped to a field name or list of possible field names, e.g. ['http.user_agent','meta.user_agent',...]. Various agents acting within the pipeline use this canonical mapping to know what field to operate on if they want to operate on user_agent
    • While interesting, we can find our way there naturally; this isn't the task to launch such an effort
  • Examination of e.g. https://schema.wikimedia.org/repositories//primary/jsonschema/fragment/mediawiki/common/current.yaml etc. showed the need for getting more awareness of what exists in the primary repository, if we should standardize on a single set of fields, how management of those fields should be shared between analytics and production use-cases, and how much we should re-use that existing work. This will need a closer look.
  • In terms of this task, the default remains to not extend meta. We had some other ideas that I will write up in a full response.
  • In terms of what we call things that are 'metadata' versus 'data', the distinction may still be useful to talk about, despite the fact that the 'meta' field is named what it is. However the convention would not necessarily be reflected in the field structure for now.
    • An option would be to use an approach where 'metadata' consists of all top-level fields except for a distinguished namespace event.* or data.* or similar, under which all the fields defined by the instrument (indeed, particular to its schema), would reside. It's not clear that this would give the same benefits, however, in terms of reminding users of provenance.
    • Perhaps all fields (aside from subfields of some structures such as http.*, etc) being top-level is the better approach. In that view, meta.* is a problematic artifact because it collects as sub-fields things that should (under this convention) be top-level.
      • This "all fields are top-level" seems to be a reasonable approach, especially under a regime like the one we expect, in which a meaningful amount (if not a majority) of the fields and their values will be 'metadata,' managed exclusively by automated processes, meaning that any namespace they did fall under would constitute the bulk of the entire event structure, if not the entirety.
  • We can assess the best way to design around meta in the short-term.
  • In the mid-term, we should consider identifying points of control in the data pipeline which use implicit conventions related to field names and schema structure, and better understand how this logic affects the design of our schema and conventions, and technical debt that we are accumulating.
  • The other mid-term goal is to consider the extent to which we should use a single set of conventions for all 'metadata' between production and analytics events, something we can consider as part of our schema fragment design process for the analytics events.
kzimmerman subscribed.

Moving to the backlog until we're ready to pick this up

LGoto triaged this task as Medium priority.

Based on recent schema discussions I think we can close this ticket. @jlinehan feel free to reopen if not.