Page MenuHomePhabricator

[SPIKE] Explore quarterly audit the use of custom properties in instrumentation
Closed, ResolvedPublic5 Estimated Story PointsSpike

Description

Background

From T352816:

As a way to measure the Metrics Platform's progress in achieve our goals, we determined a set of measurable success criteria:

  • decrease the number of custom fields across instruments by X%

We now to to establish tooling or processes to collect this information in an ongoing way as well as establish baselines.

AC

  • We understand how to continually audit the use of custom properties in instrumentation across the organisation
  • There is a publicly-accessible version of this audit
  • The process to do the audit is documented alongside the above
  • Sense-check the above with @nettrom_WMF

Notes

  1. @phuedx did such an audit as part of {TBD}, which can be found here: https://docs.google.com/spreadsheets/d/1d7KAvozivIECbFgzqM2qzwrAI44v3URbnbNQBF-P_Ns

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJan 12 2024, 3:29 PM
phuedx set the point value for this task to 5.Jan 16 2024, 10:46 AM

Props to @phuedx for talking through how we can capture this data in an automated way.

As a proof of concept, I wrote a quick python script to read current.yaml files in the jsonschema directory of the secondary repo:

import os
import yaml

custom_data_count = 0
path = "./jsonschema"

def countProperties(properties):
	return len(properties)

for root, directories, files in os.walk(path):
	for dir in directories:
		if dir == "legacy":
			continue

	for filename in files:
		if filename == "current.yaml":
			currentFile = os.path.realpath(os.path.join(root, filename))
			with open(currentFile, 'r') as file:
				current = yaml.safe_load(file)

				if 'properties' in current.keys():
					sizeProperties = len(current["properties"])
					custom_data_count += sizeProperties

print("total number of custom data properties for all instruments: " + str(custom_data_count))

The script above, if run in the base directory of the secondary repo, will return the total number of properties in the properties key of the current.yaml file of each schema directory (as opposed to the materialized schemas):

Screenshot 2024-02-04 at 4.53.34 PM.png (218×1 px, 69 KB)

We can derive Metrics Platform specific counts by checking for references to Metrics Platform base schemas in applicable schemas (i.e. filter out legacy eventcapsule fragments, etc) by looping through the current object (created from current.yaml) i.e.

for references in current["allOf"]:
   // check for pattern matching against Metrics Platform and legacy fragments

For example in the web_ui_actions schema, current.yaml looks like:

title: analytics/mediawiki/product_metrics/web_ui_actions
description: >-
  Logs when certain UI elements get visible and when user interacts with those
  on desktop and mobile.
$id: /analytics/mediawiki/product_metrics/web_ui_actions/1.0.0
$schema: https://json-schema.org/draft-07/schema#
type: object
allOf:
  - $ref: /analytics/product_metrics/web/base/1.1.0#
  - $ref: /fragment/analytics/web_accessibility/1.0.0#
properties:
  is_sidebar_collapsed:
    description: Is the sidebar collapsed?
    type: boolean
  viewport_size_bucket:
    description: Users screen resolution in CSS pixels.
    type: string
    enum:
      - '<320px'
      - 320px-719px
      - 720px-999px
      - 1000px-1199px
      - 1200px-2000px
      - '>2000px'

By programmatically checking for $ref: /analytics/product_metrics/web/base/1.1.0# as an example, we can determine which schemas use Metrics Platform core interaction base schemas and count the number of discrete custom data properties in the properties key. In this case, the total count would be incremented by 2: is_sidebar_collapsed and viewport_size_bucket.

In order to capture baselines, we should keep a running count of all custom data properties in total as well as custom data properties using Metrics Platform base schema fragments and compare them over time.

To validate data, we can compare initial audits that were done in the past to see that we're in the right ballpark. And to be explicit about how we're deriving the data, we can have the script output the names of the schemas and its corresponding count of custom properties per schema.

Options:

  • Create a tool to track these numbers by date (in some kind of lightweight database) and to run the script on a scheduled basis.
  • Lo-fi approach: we can run the script at some regular interval and capture the data in a publicly accessible spreadsheet by date to see how cumulative numbers of custom data properties change over time.

As a proof of concept, I wrote a quick python script to read current.yaml files in the jsonschema directory of the secondary repo:

Nice! You don't need to exclude migrated Legacy EventLogging schemas from the analysis though. The number of custom properties of such a schema is the number of keys in the properties.event object.

You don't need to exclude migrated Legacy EventLogging schemas from the analysis

duly noted!

hi @nettrom_WMF - when you get a chance, if you have any opinions about this approach (see T354955#9512064), feel free to pipe in as once again, you are tagged in the ACs