[Data Quality] SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	lbowmaker
	Sep 8 2023, 1:02 PM

Description

NOTE: The purpose of this Epic is to capture work done by the Data Engineering team to meet the KR SDS3.3

Key Result:

For each of the four core metric areas, at least one dataset is systematically logged and monitored, and staff receive alerts for data quality incidents as defined in data steward-informed SLOs.

Hypothesis:

If we define and document effective measures to ensure data quality we will be able to validate our instruments/datasets - this is important for those making decisions based on the data so they can understand the limitations of the data as well as the checks that ensure certain data quality levels.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		lbowmaker	T345912 [Data Quality] SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents
Open		gmodena	T345195 [Data Quality] [SPIKE] Can we identify indicators to inform an SLO for event emission and intake?
Open		gmodena	T347484 [Event Platform] Can we import metrics from logstash to promethues?
Open		gmodena	T345914 [Event Platform] Define Flink k8s operator SLO
Open		None	T345917 [Data Quality] [NEEDS GROOMING][SPIKE] Define how we can validate that mw.page_content_change is complete
Open	BUG REPORT	Ottomata	T309717 Event Utilities partially downloads schemas
Resolved		Antoine_Quhen	T346280 [Data Quality] [SPIKE] Document Current Logging, Monitoring and Data Quality Checks for webrequests
Resolved		Ahoelzl	T346283 Provide overview of best DQ practices and system design
Resolved		tchin	T347706 [Data Quality] [SPIKE] Document Current Logging, Monitoring and Data Quality Checks for Unique Devices
Resolved		Antoine_Quhen	T349532 [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs
Resolved		Antoine_Quhen	T351792 Unblock Dockerfile syntax to build images with Gitlab trusted runner
Resolved	Spike	gmodena	T354566 [Data Quality] [SPIKE] Can we migrate the anomaly detection job to DeeQu checks
Open		None	T361014 [Data Quality] Migrate the anomaly detection job to DeeQu checks
Open		Snwachukwu	T361016 [Data Quality] Migrate MWHistoryChecker to DeeQu checks

Event Timeline

lbowmaker created this task.Sep 8 2023, 1:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 8 2023, 1:02 PM

lbowmaker added a subtask: T345195: [Data Quality] [SPIKE] Can we identify indicators to inform an SLO for event emission and intake?.Sep 8 2023, 1:05 PM

lbowmaker moved this task from Parent Tasks/Epics to SDS3.3 - Data Quality on the Data Engineering and Event Platform Team board.Sep 8 2023, 1:08 PM

lbowmaker added a subtask: T309717: Event Utilities partially downloads schemas.Sep 8 2023, 1:49 PM

By the way I look at this, there are three components to data quality:

We need to instrument and ensure reliability of the platform.
We need to support instrumentation and reliability of the applications/services/pipelines running on the platform (and not necessarily owned by us).
We need to support instrumentation of the datasets produced by apps running on the platform. There can be issues here even if 1-2 meet their SLOs.

For the systems/services part of the platform, we simply plug into observability capabilities (prometheus and alertmanager) and access metrcis/logs via Grafana and logstash. The rest should follow from SLOs. We might still have gaps in instrumentation between different parts of the platform. This might require some SPIKE work, but I'd say it's a known-unknown.

I don't have a good feeling yet for the most idiomatic way to automate, collect and display dataset metrics. What does it mean to instrument a dataset?

Recently we wanted worked T340831: Provide basic data quality metrics for page_content_change. The data analysis bit was no problem, and if needed we could automate it to run on a given schedule. What was unclear is the best way of doing so (airflow + papermill like we did in the past?). Questions we could not answer within a timeboxed effort where:

Once summary stats are generated, where would we store them?
How do we go from summary stats to health metrics (equivalent to SLIs) ?
Where and how would we report metrics? Is AQS + dashiki still a thing? Turnilo? Grafana?
How do we trigger alerts?
How do we make alerts actionable?

For Event Platform applications, and page content change in particular, the current approach is to rely on prometheus for day to day ops and provide Jupyter notebooks for ad hoc analysis (when alertmanager or bug reports would require deeper investigation). IMHO there should be better ways to provide "quality as a service".

lbowmaker added subscribers: JAllemandou, Antoine, Ahoelzl, tchin.Sep 13 2023, 8:59 PM

Mayakp.wiki subscribed.Sep 15 2023, 6:25 PM

gmodena mentioned this in T345195: [Data Quality] [SPIKE] Can we identify indicators to inform an SLO for event emission and intake?.Sep 18 2023, 12:45 PM

lbowmaker closed subtask T346283: Provide overview of best DQ practices and system design as Resolved.Sep 29 2023, 1:08 PM

Antoine_Quhen edited subscribers, added: Antoine_Quhen; removed: Antoine.Sep 29 2023, 2:42 PM

Ahoelzl renamed this task from SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents to [Data Quality] SDS3.3 - Logging, Monitoring and Alerting Improvements for Data Quality Incidents.Oct 20 2023, 5:06 PM

lbowmaker edited projects, added Data-Engineering; removed Data Engineering and Event Platform Team.Nov 10 2023, 1:36 PM

lbowmaker moved this task from Incoming (new tickets) to SDS3.3 - Data Quality on the Data-Engineering board.Nov 10 2023, 1:37 PM

lbowmaker closed subtask T347706: [Data Quality] [SPIKE] Document Current Logging, Monitoring and Data Quality Checks for Unique Devices as Resolved.Nov 10 2023, 1:48 PM

lbowmaker closed subtask T346280: [Data Quality] [SPIKE] Document Current Logging, Monitoring and Data Quality Checks for webrequests as Resolved.

lbowmaker closed subtask T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs as Resolved.Jan 8 2024, 7:50 PM

lbowmaker closed subtask T354566: [Data Quality] [SPIKE] Can we migrate the anomaly detection job to DeeQu checks as Resolved.Jan 30 2024, 12:19 PM