Page MenuHomePhabricator

[Data Quality] SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents
Open, Needs TriagePublic

Description

NOTE: The purpose of this Epic is to capture work done by the Data Engineering team to meet the KR SDS3.3
Key Result:

For each of the four core metric areas, at least one dataset is systematically logged and monitored, and staff receive alerts for data quality incidents as defined in data steward-informed SLOs.

Hypothesis:

If we define and document effective measures to ensure data quality we will be able to validate our instruments/datasets - this is important for those making decisions based on the data so they can understand the limitations of the data as well as the checks that ensure certain data quality levels.

Related Objects

Event Timeline

By the way I look at this, there are three components to data quality:

  1. We need to instrument and ensure reliability of the platform.
  2. We need to support instrumentation and reliability of the applications/services/pipelines running on the platform (and not necessarily owned by us).
  3. We need to support instrumentation of the datasets produced by apps running on the platform. There can be issues here even if 1-2 meet their SLOs.

For the systems/services part of the platform, we simply plug into observability capabilities (prometheus and alertmanager) and access metrcis/logs via Grafana and logstash. The rest should follow from SLOs. We might still have gaps in instrumentation between different parts of the platform. This might require some SPIKE work, but I'd say it's a known-unknown.

I don't have a good feeling yet for the most idiomatic way to automate, collect and display dataset metrics. What does it mean to instrument a dataset?

Recently we wanted worked T340831: Provide basic data quality metrics for page_content_change. The data analysis bit was no problem, and if needed we could automate it to run on a given schedule. What was unclear is the best way of doing so (airflow + papermill like we did in the past?). Questions we could not answer within a timeboxed effort where:

  • Once summary stats are generated, where would we store them?
  • How do we go from summary stats to health metrics (equivalent to SLIs) ?
  • Where and how would we report metrics? Is AQS + dashiki still a thing? Turnilo? Grafana?
  • How do we trigger alerts?
  • How do we make alerts actionable?

For Event Platform applications, and page content change in particular, the current approach is to rely on prometheus for day to day ops and provide Jupyter notebooks for ad hoc analysis (when alertmanager or bug reports would require deeper investigation). IMHO there should be better ways to provide "quality as a service".

Ahoelzl renamed this task from SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents to [Data Quality] SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents.Oct 20 2023, 5:06 PM