Page MenuHomePhabricator

Metrics Platform Schema: Define & Model Bespoke Metrics Data
Open, HighPublic

Description

Goal: Look at how to model bespoke for the metrics platform schema.

Impact:

  • Provides a way to identify non-standard fields in a way that users can still ingest and understand without needing deep knowledge of the schema.
  • Provides a way to track and describe bespoke data fields that can be consumed by downstream processes without additional customization needed.

Success Criteria:

  • Model how bespoke data points will be collected in the Metrics Platform
  • Bespoke data points have clear descriptions and can be leveraged in analytics

Event Timeline

DAbad triaged this task as High priority.May 10 2021, 3:53 PM
DAbad renamed this task from Define Bespoke Metrics Data Capture to Metrics Platform Schema: Define & Model Bespoke Metrics Data.May 13 2021, 4:11 PM
DAbad updated the task description. (Show Details)

Hi all!
@DAbad @jlinehan Yesterday's Metrics Platform demo meeting was really helpful to better understand the project.
It generated some discussions in our team, and I'd like to share the gist of it (although part of it we already discussed in the demo):

  • We all agreed that the normalization and standardization of fields that are common in most analytics schemas is great for several reasons: reducing cognitive load when instrumenting, allowing for cross-schema referencing, unifying dimension formats... and will bring great value.
  • We also wondered whether the fact that bespoke fields are not specified in the schema might affect data discoverability and data lineage. We believe it would be good to have bespoke fields be somehow "schemaed" as well; for developers/analysts as a historical documentation; and for related services (sanitization, validation, dashboarding tools, data governance tools, ...) as a programmatic reference. Please, let us know your thoughts!

Thanks :]

A bit more elaboration on Marcel's second point: If the data is not schemed, it its not possible for UI and data governance tools to 'discover' this data. That is, the data without explicit keys, we won't have any visible information about what data is in a stream; the stream data would have to explicitly be queried somewhere to show this, which isn't easily automatable. We won't be able to track field lineage between datasets.

This isn't a blocker from me, but just wanted to make sure this point is considered when making this choice. :)

I second the concerns that @mforns outlined. It would be worthwhile exploring ways to improve and make schema registration easier if that's part of the issue.