update status page latency for mw-on-k8s
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	CDanis
	Tue, Jun 18, 4:46 PM

Description

Until today, the latency metric used was mtail-computed average Apache response time across the appservers (a mean of means).

This has been pretty inaccurate/unreliable for a while now in the transition to mw-on-k8s, but with today marking the 100% transition, it's definitely time to switch.

After some experimenting with queries and also taking into consideration the load they place on Thanos, we'll be moving to a similar metric computed by Benthos.

On the Statuspage side, we'll also be purging the history of the current latency metric so we can backfill it with the new metric.
This will cause a temporary (~20 minute?) lack of data in the metric, but that's fine.
(It *is* possible to do this user-invisibly, but that requires multiple back and forth updates to both the Statuspage admin UI and also patching the config file -- on the UI create a new, invisible metric; on statograph add it to the config file and start populating data; wait a while; etc)

Details

	Subject	Repo	Branch	Lines +/-
	statograph: Use benthos query to save thanos	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Stalled		None	T255792 Quibble runs core:unit tests twice!
Open		None	T328919 Upgrade to PHPUnit 10
Open		None	T338103 Micro-optimize ApiResult::isMetadataKey with str_starts_with once we support PHP8+
Open		None	T328921 Drop PHP 7.4 support from MediaWiki
Stalled		None	T334726 Use return type `never` in Wikibase
Open		None	T328922 Drop PHP 8.0 support from MediaWiki
Stalled		None	T319055 Upgrade to psr/container 2.x
Stalled	Feature	None	T364249 New upstream release for Pygments (2.18.0)
Stalled		Krinkle	T319432 Migrate WMF production from PHP 7.4 to PHP 8.1
Open		None	T291916 Tracking task for Bullseye migrations in production
Open		None	T368366 Upgrade K8s docker images to running in production on Buster with either Bullseye or Bookworm
Stalled		None	T356293 Migrate MW appservers' base images to bullseye
Open		None	T290536 Serve production traffic via Kubernetes
Open		Clement_Goubert	T362323 Move 100% of external traffic to Kubernetes
Resolved		CDanis	T367894 update status page latency for mw-on-k8s

Event Timeline

CDanis created this task.Tue, Jun 18, 4:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Jun 18, 4:46 PM

Change #1047138 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] statograph: Use benthos query to save thanos

https://gerrit.wikimedia.org/r/1047138

gerritbot added a project: Patch-For-Review.Tue, Jun 18, 4:55 PM

CDanis updated the task description. (Show Details)Tue, Jun 18, 5:10 PM

Change #1047138 merged by CDanis:

[operations/puppet@production] statograph: Use benthos query to save thanos

https://gerrit.wikimedia.org/r/1047138

Mentioned in SAL (#wikimedia-operations) [2024-06-18T17:21:09Z] <cdanis> resetting Wiki response time metric on wikimedia.statuspage.io following complete switch to k8s - T362323 T367894

Stashbot mentioned this in T362323: Move 100% of external traffic to Kubernetes.Tue, Jun 18, 5:21 PM

Maintenance_bot removed a project: Patch-For-Review.Tue, Jun 18, 5:30 PM

💙cdanis@alert1001.wikimedia.org ~ 🕜☕ sudo statograph -c /etc/statograph/config.yml  list_metrics                            
Metric 'Wiki response time' (id lyfcttm2lhw4) with most recent data at Tue, 18 Jun 2024 17:30:00 +0000 (@1718731800.0)

We're caught up. After several minutes of indexing delay, it also looks good on wikimediastatus.net.

CDanis added parent tasks: T362323: Move 100% of external traffic to Kubernetes, T290536: Serve production traffic via Kubernetes.Tue, Jun 18, 5:37 PM

update status page latency for mw-on-k8sClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

update status page latency for mw-on-k8s
Closed, ResolvedPublic
Actions

Related Objects
Search...