Implementation task coming out of spike T354955
Description
Write a script for capturing the number and description of custom data properties in instrumentation schemas.
Use Case
To establish Metrics Platform baseline of measuring the number of custom data properties in WMF instruments given the following success criteria:
decrease the number of custom fields across instruments by X%
User Story
As a product manager, I want to know the change over time of the number of custom data properties used in WMF schemas for instrumentation purposes in order to gauge the efficacy of Metrics Platform adoption.
Outcome
We know how many custom data properties are being used at any one time and over time in WMF instruments.
Acceptance Criteria
- A script is written to provide the following data points:
- The names of the schemas and its corresponding count and names of custom properties per schema
- The total number of custom properties being used
- The total number of custom properties in Metrics Platform specific schemas
- See https://gitlab.wikimedia.org/repos/data-engineering/custom-data-monitor
- For MVP, we maintain a spreadsheet that tracks this data over time (monthly? quarterly?) starting from July 2023.
- We have a mechanism for synthesizing/analyzing script data
Required
- Documentation
Technical Notes
In T354955#9512064, an example script shows how to convert yaml into an object that can be iterated over to include/exclude schemas with specific fragment references and to count the number of properties in the schema.
Depending on how we want to make the result available over time, we may want to build a tool/interface to show results over time. This can be spun off into its own task unless we go the lo-fi route of outputting data to a spreadsheet in which case, details of that solution should be included here as part of the 2nd AC.