We should be able to provide summary statistics of the page_content_change topic, so that we ensure a baseline for data correctness.
Description
Description
Details
Details
Title | Reference | Author | Source Branch | Dest Branch | |
---|---|---|---|---|---|
Add quality and drift data analysis | repos/data-engineering/mediawiki-event-enrichment!67 | gmodena | T340831-add-data-analysis | main |
Related Objects
Related Objects
Event Timeline
Comment Actions
gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/67
Draft: Add quality and drift data analysis
Comment Actions
Key takeways:
- page_change_v1 shows stable throghput over time. We should be able to characterize "normal" traffic.
- rc1_mediawiki_page_content_change contains spurious data (pipeline re-runs, duplicate events) that skews statistics.
- processed time vs event time drift seems consistent with both behaviour. This issue is not related to these specific dataset, but should be investigated upstream.
Metric metrics we should consider for alerting on quality regression::
- absolute number of processed events (consumed, produced, produced vs consumed).
- rate of change (day to day) of processed events (consumed, produced, produced vs consumed).
- day-to-day variation in rate of change of processed events (consumed, produced, produced vs consumed).
- processed time vs events time drift (number of events with drift > 1 day per period).
- error type distribution over time (TBD, not enough data).
Comment Actions
gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/67
Add quality and drift data analysis