Page MenuHomePhabricator

update status page latency for mw-on-k8s
Closed, ResolvedPublic

Description

Until today, the latency metric used was mtail-computed average Apache response time across the appservers (a mean of means).

This has been pretty inaccurate/unreliable for a while now in the transition to mw-on-k8s, but with today marking the 100% transition, it's definitely time to switch.

After some experimenting with queries and also taking into consideration the load they place on Thanos, we'll be moving to a similar metric computed by Benthos.

On the Statuspage side, we'll also be purging the history of the current latency metric so we can backfill it with the new metric.
This will cause a temporary (~20 minute?) lack of data in the metric, but that's fine.
(It *is* possible to do this user-invisibly, but that requires multiple back and forth updates to both the Statuspage admin UI and also patching the config file -- on the UI create a new, invisible metric; on statograph add it to the config file and start populating data; wait a while; etc)

Event Timeline

Change #1047138 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] statograph: Use benthos query to save thanos

https://gerrit.wikimedia.org/r/1047138

Change #1047138 merged by CDanis:

[operations/puppet@production] statograph: Use benthos query to save thanos

https://gerrit.wikimedia.org/r/1047138

Mentioned in SAL (#wikimedia-operations) [2024-06-18T17:21:09Z] <cdanis> resetting Wiki response time metric on wikimedia.statuspage.io following complete switch to k8s - T362323 T367894

CDanis claimed this task.
💙cdanis@alert1001.wikimedia.org ~ 🕜☕ sudo statograph -c /etc/statograph/config.yml  list_metrics                            
Metric 'Wiki response time' (id lyfcttm2lhw4) with most recent data at Tue, 18 Jun 2024 17:30:00 +0000 (@1718731800.0)

We're caught up. After several minutes of indexing delay, it also looks good on wikimediastatus.net.