Page MenuHomePhabricator

Define SLOs for the intermediate table of Dumps 2.0
Closed, ResolvedPublic

Description

After T358366: Consult with Product and Research team on schema and data retention expectations for wmf_dumps.wikitext_raw and T358373: [Dumps 2] Reconciliation mechanism to detect and fetch missing/mismatched revisions, we should define SLOs for the intermediate table wmf_dumps.wikitext_raw, taking into consideration the upstream SLOs, if any, and how far we can go with T358373.

AC:

Stretch goal:

  • update mw-page-content-change-enrich SLO

Link to draft Google Doc.

Event Timeline

I want to start drafting an SLO document this week and would like to validate the direction first.

The mediawiki_content_history iceberg table will serve as the public API boundary for the Dumps 2.0 architecture.

At a high level, we define system health by ensuring that data landing in the iceberg table meets (to be agreed upon) the following criteria:

  • Availability: minimal missing events (I am avoiding the term "completeness" on purpose)
  • Freshness: minimal lag from mediawiki

We may add additional SLIs based on data quality (DQ), but for now, I’d prefer to keep the two frameworks separate.

Dumps 2.0 boundaries

In terms of systems, the Dumps 2.0 team boundary include:

  • Spark jobs and airflow DAGs for batch processing and reconciliation
  • mediawiki-event-enrichment application (streaming flink app on dse k8s).

Reliability for these components should be covered by the Dumps 2.0 SLO, either as subcomponents or with dedicated SLOs.

Initially, I propose a single SLO that covers:

  • Spark and Airflow jobs (with distinct failure domains for DAGs/tasks where failures may degrade SLIs)
  • The Flink app (with its own failure domain and impact on SLIs)
  • The mediawiki_content_history table

Client-facing SLIs would apply only to the final table.

Dependencies

Dumps 2.0 systems has dependencies on third party systems (outside team boundaries), not all of which have an SLO

The fact that dependent systems don't have an SLO is not a blocker, but we should model their reliability expectation explicitly.

The fact that dependent systems don't have an SLO is not a blocker, but we should model their reliability expectation explicitly.

Could you expand on how we would do that?

The fact that dependent systems don't have an SLO is not a blocker, but we should model their reliability expectation explicitly.

Could you expand on how we would do that?

As a starting point, Wikimedia's SLOs template recommends:
For dependencies without an SLO yet, or dependencies that habitually miss their SLO, assume that they maintain their historical performance, or worsen slightly but not dramatically

And then we'll adjust after collecting data in the observation period. This is approach is relatively common AFAIK in other parts of our stack.

The ops week effort will pay off big times, we have a good 12 month history of observing systems performance and can provide reasonable estimates for the dependencies listed above (cc / @Ahoelzl ).

gmodena updated the task description. (Show Details)
gmodena updated the task description. (Show Details)
gmodena moved this task from Sprint Backlog to In Process on the Dumps 2.0 (Kanban Board) board.

Shared link to draft google doc on description above.