Page MenuHomePhabricator

Re-create poolcounter instance in Beta Cluster (deployment-prep)
Closed, ResolvedPublic

Description

Background

poolcounter05 and poolcounter06.deployment-prep.eqiad1.wikimedia.cloud were shutdown in T370458: Remove or replace poolcounter06.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation) despite being used actively used by MediaWiki and Thumbor.

https://codesearch.wmcloud.org/search/?q=deployment-poolcounter

deployment-prep/deployment-imagescaler.yaml
thumbor::poolcounter_server: deployment-poolcounter05.deployment-prep.eqiad.wmflabs
wmf-config/LabsServices.php
		'poolcounter' => [
			'deployment-poolcounter06.deployment-prep.eqiad.wmflabs',
		],

Found this from T332015 and a simple question on my side is, does anyone know why this VM needs to exist? Aside from "this is how it is in production", […], isn't a very good answer.

It's not so much existing, as behaving making services behave substantially different. Eg. MediaWiki with and without serving stale content and coellescing Parser invocations, and MW CirrusSearch with and without throttle, and Thumbor (or MW image scaling) with or without throttling.

I imagine this is likely causing logspam at the moment, making it harder to diagnose other issues.
And in terms of testing, this will of course decrease the value of testing, e.g. thumbnails are generated concurrently without limits which has often been the source of bugs in production and would benefit from behaving the same in Beta. Likewise, fast-serving stale ParserOutput is the kind of thing that can catch people off-guard. The earlier people notice this the better.

Details

Event Timeline

I am not the one that shut down that VM, I just offered a PoV in T370458: Remove or replace poolcounter06.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation). For that Beta VM to be shutdown, it probably means that no one showed up willing to claim it (and work on migrating it to Debian bullseye). WMCS went ahead with their policy to delete unclaimed VMs. See https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2024_Purge where it is even documented

I wish I had a good solution, but I don't. Assuming that it is still true that no one is willing to claim the work of setting up a poolcounter VM in Beta, the only suggestion I can offer is to remove the VM from the configuration altogether. That should stop the logspam at least. As far as the different behavior goes, I can only acknowledge what you point out. I have no solution, other than suggesting to bring that up in T215217: deployment-prep (beta cluster): Code stewardship request. Maybe it can be put in the roadmap of the team.

Noise while monitoring after PHP 8.3 upgrade in Beta Cluster, I noticed tons of mwmaint and jobrunner procs fail/degrade due to PoolCounter absence, which reduces the value of testing there. (Not attaching to T401855, because this is a pre-existing issue.)

https://beta-logs.wmcloud.org/app/dashboards#/view/default
https://wikitech.wikimedia.org/wiki/OpenSearch_Dashboards#Beta_Cluster_Logstash

host: deployment-mwmaint03

Pool key 'CirrusSearch-Search:_elasticsearch' (CirrusSearch-Search): ⧼poolcounter-connection-error⧽
Pool key 'CirrusSearch-Search:_elasticsearch' (CirrusSearch-Search): ⧼poolcounter-connection-error⧽
Pool key 'CirrusSearch-Search:_elasticsearch' (CirrusSearch-Search): ⧼poolcounter-connection-error⧽

Screenshot 2025-09-09 at 23.23.17.png (1×2 px, 251 KB)

host: deployment-mediawiki14

Pool key 'enwiki:pcache:1789:|%23|:idhash:canonical:revid:23905' (ArticleView): ⧼poolcounter-connection-error⧽
Pool key 'enwiki:pcache:89445:|%23|:idhash:dateformat=default:revid:397049' (ArticleView): ⧼poolcounter-connection-error⧽
Pool key 'enwiki:pcache:167887:|%23|:idhash:canonical:revid:653712' (ArticleView): ⧼poolcounter-connection-error⧽

Screenshot 2025-09-09 at 23.25.35.png (1×2 px, 501 KB)

Mentioned in SAL (#wikimedia-releng) [2025-09-23T22:41:35Z] <Krinkle> Create deployment-poolcounter07 host (debian-12.0-bookworm with 2GB RAM, same as prod; 1 cpu instead of 2 cpu, unlike prod). ref T380881

Up and reachable from mediawiki and thumbor hosts.

krinkle@deployment-poolcounter07:~$ ps aux | grep pool
poolcou+   22078  0.0  0.0   4744  1424 ?        Ss   22:43   0:00 /usr/bin/poolcounterd -l 0.0.0.0
poolcou+   22755  0.0  0.8 1015508 16664 ?       Ssl  22:43   0:00 /usr/bin/poolcounter-prometheus-exporter
krinkle    25956  0.0  0.0   3332  1484 pts/0    S+   22:59   0:00 grep --color=auto pool
krinkle@deployment-poolcounter07:~$ echo 'STATS FULL' | nc -w1 localhost 7531 
uptime: 0 days, 0h 18m 17s
…
total_acquired: 0
total_releases: 0
…

krinkle@deployment-mediawiki13:~$ echo 'STATS FULL' | nc -w1 deployment-poolcounter07 7531 
uptime: 0 days, 0h 18m 28s
…
total_acquired: 0
total_releases: 0
…

krinkle@deployment-imagescaler04:~$ echo 'STATS FULL' | nc -w1 deployment-poolcounter07 7531 
uptime: 0 days, 0h 32m 5s
…
total_acquired: 0
total_releases: 0
…

Change #1190796 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] beta: Pool deployment-poolcounter07

https://gerrit.wikimedia.org/r/1190796

Change #1190796 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: Pool deployment-poolcounter07

https://gerrit.wikimedia.org/r/1190796

It is being used:

krinkle@deployment-mediawiki13:~$ echo 'STATS FULL' | nc -w1 deployment-poolcounter07 7531 
uptime: 0 days, 1h 6m 12s
total processing time: 39.891312s
average processing time: 0.814108s
…
total_acquired: 51
total_releases: 49
Krinkle claimed this task.

Logstash dashboard has cleared from these errors.

Previously, CirrusSearch jobs were hitting PoolCounter warnings over 200K times daily.

Screenshot 2025-09-24 at 01.09.38.png (1×1 px, 240 KB)