Page MenuHomePhabricator

Investigate what caused our WDQS update lag to drop and remain below SLO on 28/07
Closed, ResolvedPublic

Description

This ticket is to investigate what caused WDQS update lag to drop at the end of July. We can file another ticket for any resolution actions that need to be taken.

WDQS uptime lag dropped around 28 July below our established 95% SLO: https://grafana.wikimedia.org/d/yCBd7Tdnk/wdqs-wcqs-lag-slo?orgId=1&from=now-90d&to=now&var-cluster_name=wdqs&var-lag_threshold=600&var-slo_period=30d

AC:

  • identify reason for drop in WDQS update lag SLO

Event Timeline

per talking to David:

This is due to two events that I think did not have any major user impact:

  • July 23 at 08:00: wdqs1004 suffered from a deadlock and was automatically depooled, the machine remained depooled until July 26 at 21:00 and it took 1.5days to catchup (I can't remember if this machine remained depooled while it caught up so it is possible that some user queries might have returned not up to date results during that period).
  • August 9 at 15:00, I performed some maintenance operations on the streaming updater running in k8s@codfw due to T314835. Before doing so Brian did depool codfw so users were not impacted.

The first problem is due to one of the two failure modes we know can affect blazegraph, one is the memory pressure that was mitigated using jvmquake, the other is a deadlock for which we don't have yet a remediation in place.
The second problem is mainly due to the fact that our SLO calculation do not know what servers are pooled (related to https://phabricator.wikimedia.org/T238751) and thus can burn our budget even in case of planned maintenance operations.
Barring any new problems the update lag SLO should be back to normal in a couple weeks (due to the 30d time window). I think it's already back to normal if you select 7d for "Period to calculate"

No further action is needed at this time

MPhamWMF claimed this task.