Page MenuHomePhabricator

Create metrics for measuring WDQS/WCQS update lag
Closed, ResolvedPublic8 Estimated Story Points

Description

As a product manager, I want to know the baseline and current lagtime for WDQS/WCQS, so that I can report on how well the products are performing with respect to our annual OKR of getting update lag under 10 minutes.

This Grafana view allows us to see performance of the different WDQS servers with respect to lag: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=8&orgId=1&refresh=1m&from=now-90d&to=now. However it does not allow us to know what the average/effective lag is, as it reports each server independently. It also doesn't take into account depooled servers allowed to catch up, and data reloads, which may not affect users directly.

Until we have an effective way of accurately capturing this, the KR will reported indirectly by counting number of 10min+ lag events given a time period. This will be an overestimate of actual update lag, for the reasons above relating to depooled servers/data reload.

AC:

  • There is a dashboard for WDQS/WCQS update lag time
  • Establish an update lag baseline going into the 2021-2022 fiscal year (so we can track the impact of our work)
  • Preferably can be raport can be produced based on historical data

Event Timeline

Here's a proposal for an SLO dashboard, that can use historical data to provide insight into the past performance: https://grafana-rw.wikimedia.org/d/yCBd7Tdnk/wdqs-lag-slo

Let's decide on our SLO goal before we close this ticket out

Also, can we have one set up for WCQS? Does it make more sense to set that up as part of the WCQS epic? or is it something we can extend this dashboard out to now and only start tracking after we deploy to production?

Gehel subscribed.

WCQS has been added to the same dashboard as a parameter.