Page MenuHomePhabricator

Wikidata MaxLag above 10 for 1hr
Closed, ResolvedPublic

Description

I saw:

[1] Firing
Labels
alertname = Max Lag above 10 for 1 hour
metric = wikibase-queryservice
runbook = https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Alerts
severity = critical
team = wikidata
Annotations
description = Maxlag is above ten seconds for more than an hour
summary = Max Lag above 10 for 1 hour

around 18:40 UTC

Looks like due to an ongoing incident affecting the codfw queryservice servers the lag on them is high. Due to T238751 being unresolved this causes high maxlag.

Event Timeline

Change 764875 had a related patch set uploaded (by Addshore; author: Addshore):

[operations/puppet@production] Temp remove codfw

https://gerrit.wikimedia.org/r/764875

Addshore triaged this task as Unbreak Now! priority.Feb 22 2022, 7:20 PM
Addshore subscribed.

Marking as UBN, as this affects editing and user experience on wikidata

Change 764875 merged by Ryan Kemper:

[operations/puppet@production] Temp remove codfw from wikidata updateQueryServiceLag check

https://gerrit.wikimedia.org/r/764875

Mentioned in SAL (#wikimedia-operations) [2022-02-22T19:25:32Z] <ryankemper> T302330 ryankemper@cumin1001:~$ sudo -E cumin '*mwmaint*' 'run-puppet-agent' (getting https://gerrit.wikimedia.org/r/c/operations/puppet/+/764875 out)

We might want to tweak the script so that normal (non-puppet) deployers can fix these things (e.g. they read from mw-config or something)

Addshore lowered the priority of this task from Unbreak Now! to High.Feb 23 2022, 12:12 PM

No longer UBN

Change 764830 had a related patch set uploaded (by Addshore; author: Addshore):

[operations/puppet@production] Revert \"Temp remove codfw from wikidata updateQueryServiceLag check\"

https://gerrit.wikimedia.org/r/764830

Gehel claimed this task.

Change 764830 merged by Bking:

[operations/puppet@production] Revert "Temp remove codfw from wikidata updateQueryServiceLag check"

https://gerrit.wikimedia.org/r/764830