Page MenuHomePhabricator

Provide basic data quality metrics for page_content_change
Closed, ResolvedPublic

Description

We should be able to provide summary statistics of the page_content_change topic, so that we ensure a baseline for data correctness.

Details

TitleReferenceAuthorSource BranchDest Branch
Add quality and drift data analysisrepos/data-engineering/mediawiki-event-enrichment!67gmodenaT340831-add-data-analysismain
Customize query in GitLab

Event Timeline

Key takeways:

  • page_change_v1 shows stable throghput over time. We should be able to characterize "normal" traffic.
    • rc1_mediawiki_page_content_change contains spurious data (pipeline re-runs, duplicate events) that skews statistics.
  • processed time vs event time drift seems consistent with both behaviour. This issue is not related to these specific dataset, but should be investigated upstream.

Metric metrics we should consider for alerting on quality regression::

  • absolute number of processed events (consumed, produced, produced vs consumed).
  • rate of change (day to day) of processed events (consumed, produced, produced vs consumed).
  • day-to-day variation in rate of change of processed events (consumed, produced, produced vs consumed).
  • processed time vs events time drift (number of events with drift > 1 day per period).
  • error type distribution over time (TBD, not enough data).