Marked as invalid, using a different task for this
Wed, Mar 3
Wed, Feb 24
Wed, Feb 17
Wed, Feb 10
Mon, Feb 8
Feb 3 2021
- Likely undesirable to namespace all non-producer-managed fields (what we refer to as 'metadata' in above discussion) under any common namespace, meta.* or otherwise
- Main reason: that namespace would need to be known to all elements of the data pipeline which set these field values
- A body of pre-existing fields in the top level exists which would be undesirable to move/replicate, necessitating changes in behavior at various pipeline stages
- Possible solution: indirection layer to define managed fields as 'entities' (e.g. user_agent) which are then mapped to a field name or list of possible field names, e.g. ['http.user_agent','meta.user_agent',...]. Various agents acting within the pipeline use this canonical mapping to know what field to operate on if they want to operate on user_agent
- While interesting, we can find our way there naturally; this isn't the task to launch such an effort
- Examination of e.g. https://schema.wikimedia.org/repositories//primary/jsonschema/fragment/mediawiki/common/current.yaml etc. showed the need for getting more awareness of what exists in the primary repository, if we should standardize on a single set of fields, how management of those fields should be shared between analytics and production use-cases, and how much we should re-use that existing work. This will need a closer look.
- In terms of this task, the default remains to not extend meta. We had some other ideas that I will write up in a full response.
- In terms of what we call things that are 'metadata' versus 'data', the distinction may still be useful to talk about, despite the fact that the 'meta' field is named what it is. However the convention would not necessarily be reflected in the field structure for now.
- An option would be to use an approach where 'metadata' consists of all top-level fields except for a distinguished namespace event.* or data.* or similar, under which all the fields defined by the instrument (indeed, particular to its schema), would reside. It's not clear that this would give the same benefits, however, in terms of reminding users of provenance.
- Perhaps all fields (aside from subfields of some structures such as http.*, etc) being top-level is the better approach. In that view, meta.* is a problematic artifact because it collects as sub-fields things that should (under this convention) be top-level.
- This "all fields are top-level" seems to be a reasonable approach, especially under a regime like the one we expect, in which a meaningful amount (if not a majority) of the fields and their values will be 'metadata,' managed exclusively by automated processes, meaning that any namespace they did fall under would constitute the bulk of the entire event structure, if not the entirety.
- We can assess the best way to design around meta in the short-term.
- In the mid-term, we should consider identifying points of control in the data pipeline which use implicit conventions related to field names and schema structure, and better understand how this logic affects the design of our schema and conventions, and technical debt that we are accumulating.
- The other mid-term goal is to consider the extent to which we should use a single set of conventions for all 'metadata' between production and analytics events, something we can consider as part of our schema fragment design process for the analytics events.
Feb 1 2021
I think for this spike it's okay to just name it something and go from there. I think lablels is not very suggestive of either purpose or provenance in the way that meta(data) is, but this task should *not* block on that bikeshed, for sure. Also, colliding with ECS names could be complicated, as the error logging thing showed.
Can you give an example maybe? There is the thing-being-measured (data), and the things-about-the-thing-being-measured (metadata), including any properties of the data itself, but also contextual information about who, what, when, why, and how it was created.
I'd say meta-data is data about the data. The example I gave was meta.stream could probably be metadata. So could data ownership as you say. But things like timestamps and ids and domains and mediawiki skins are actual data, not data about the data.
BTW, here's the first reference to meta I can find: https://github.com/wikimedia/restevent/pull/5/files
Hm, I like the motivation here: somehow clearly delineating what can/should be set by instruments and what is set by libraries. I think in practice this is going to be hard, but we can do our best.
Can we use something other than meta? I think term 'meta' or 'metadata' here is pretty overloaded and extra confusing
I think I'd disagree, in terms of comprehensibility, I think this distinction is pretty common. Can you give an example maybe? There is the thing-being-measured (data), and the things-about-the-thing-being-measured (metadata), including any properties of the data itself, but also contextual information about who, what, when, why, and how it was created. This intuition seems to match what we're trying to acheive, by having instruments focus on only the thing-being-measured.
Nice! Interesting, the KaiOS app is JS? Cool.
@jlinehan maybe we should one day consider having language specific client libraries (JS, PHP, Java, etc.) , rather than app specific (MW, Android, iOS, KaiOS).
Jan 29 2021
meta is a vestigial historical compromise, and if we could get rid of it I would. It isn't impossible to get rid of it, it would just be a bit of work.
Jan 28 2021
@Ottomata @Mholloway Made a dedicated bikeshed task to capture syntax discussion, see: T273235: [Metrics Platform] Define stream configuration syntax relevant to v1 release
Having a separate repo might make it easier for us to adapt to any changes in how operations/mediawiki-config changes over the coming years, as well as make it easier to add hooks etc and expose the repo for public browsing as in schema.wikimedia.org. For me at least, fewer repos is better mostly from a usability perspective of not needing to clone/keep track of more repositories in order to make a change, but here you've got to clone something either way (a standalone repo or operations/mediawiki-config). Having a small clean repo that only does one thing would probably make it easier for us to build an interface on top of it if we ever go that way, but more approachable for users either way. If "deployment" consists of pulling the submodule update into mediawiki-config, that seems kind of neat as well. I'd vote separate repo.
Jan 27 2021
Uploaded WIP patch for engineers to discuss if desired at BUOD meeting. Not tested yet while I wrestle with my vagrant install but the point is clear. Will test, iterate patchsets and add tests from here.
We never really figured out a good smart way to do this in stream config. Is it worth trying to solve this for all these use cases, or is that too cumbersome?
Jan 21 2021
Jan 14 2021
Sampling session lengths should be done with a token that uses the same semantics as the sessions themselves, so this is a dependency of sampling the session tick data stream.
Jan 13 2021
Hey all :]
I looked a bit into the size and length of the session_tick data that we're collecting right now, to determine what sampling rate we'll need to use.
Thank you @mforns for crunching the numbers and writing all of this up
Dec 16 2020
Seems to be fixed for now. @jlinehan do you want to keep this task open to track your work on it (I assume you’ll eventually want to un-revert that change in some form), or is it okay to close?
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/645430 was reverted while we examine the source of the error. Sorry for the inconvenience.
Alright, then I’ll stop tinkering with my above changes for the time being :) thanks!
Thanks for creating this ticket, we're looking into this now and should resolve soon.
Dec 11 2020
For app specific event schemas, prefix with the app name:
For anything that might be shared across apps, don't prefex:
Dec 10 2020
B. is hard to do, and requires a lot of coordination. But we could do it slowly one schema at a time, and start with the ones we want to import into logstash. We'd make an fragment/http/2.0.0,...or maybe an fragment/ecs/http/1.0.0, and then include it in mediawiki/client/error. To do this we'd need to make eventgate-wikimedia aware of this new convention and set the fields appropriately. Ungh, and if we hoped to eventually migrate ALL existent schemas to ECS's http, the Hive tables would have both http subschema fields (e.g. http.request_headers and http.request.headers) probably forever (unless we manually intervened).