Page MenuHomePhabricator

CirrusSearch generates a massive amount of "poolcounter-connection-error" messages
Closed, ResolvedPublic

Description

Since rolling 1.42.0-wmf.7 (T350083), I have noticed the MediaWiki log bucket for poolcounter receiving a lot of messages:

mw_poolcounter_log.png (279×1 px, 23 KB)

That is 100-150k messages per minute.

They all seem to come from CirrusSearch with a poolcounter-connection-error

normalized_messageCount
Pool key 'CirrusSearch-Search:_elasticsearch_enwiki' (CirrusSearch-Search): ⧼poolcounter-connection-error⧽360,625
Pool key 'CirrusSearch-Completion:_elasticsearch_enwiki' (CirrusSearch-Completion): ⧼poolcounter-connection-error⧽359,992
Pool key 'CirrusSearch-Completion:_elasticsearch' (CirrusSearch-Completion): ⧼poolcounter-connection-error⧽359,918
Pool key 'CirrusSearch-Search:_elasticsearch' (CirrusSearch-Search): ⧼poolcounter-connection-error⧽359,778
Pool key 'CirrusSearch-MoreLike:_elasticsearch' (CirrusSearch-MoreLike): ⧼poolcounter-connection-error⧽359,197
Pool key 'CirrusSearch-MoreLike:_elasticsearch_enwiki' (CirrusSearch-MoreLike): ⧼poolcounter-connection-error⧽193,764
Pool key 'CirrusSearch-Automated:_elasticsearch_enwiki' (CirrusSearch-Automated): ⧼poolcounter-connection-error⧽146,143
Pool key 'CirrusSearch-Automated:_elasticsearch' (CirrusSearch-Automated): ⧼poolcounter-connection-error⧽136,755
Pool key 'CirrusSearch-Prefix:_elasticsearch' (CirrusSearch-Prefix): ⧼poolcounter-connection-error⧽30,883
Pool key 'CirrusSearch-Prefix:_elasticsearch_enwiki' (CirrusSearch-Prefix): ⧼poolcounter-connection-error⧽7,615

Event Timeline

hashar triaged this task as Unbreak Now! priority.Nov 30 2023, 3:39 PM

The rate of locks acquired, released and processed requests are falling on Nov 29th around 9:15 and on again on Nov 30 around 9:15 all to the point of being almost flat. https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&from=now-2d&to=now&var-dc=codfw%20prometheus%2Fops

{F41548993 size=full}

Change 979079 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/core@master] Revert "PoolCounterConnectionManager: Add support for ipv6"

https://gerrit.wikimedia.org/r/979079

This almost certainly comes from https://gerrit.wikimedia.org/r/c/mediawiki/core/+/972724 which introduces poolcounter-connection-error message. It is for adding IPv6 support (T350615).

From discussions the code tries to parse a host + port and our config only provides the host: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/d5e0b595c3ad74fb24ec299f7ea2321855c2d1bf/wmf-config/ProductionServices.php#131

Change 979080 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/core@wmf/1.42.0-wmf.7] Revert "PoolCounterConnectionManager: Add support for ipv6"

https://gerrit.wikimedia.org/r/979080

Change 979079 merged by jenkins-bot:

[mediawiki/core@master] Revert "PoolCounterConnectionManager: Add support for ipv6"

https://gerrit.wikimedia.org/r/979079

Change 979080 merged by jenkins-bot:

[mediawiki/core@wmf/1.42.0-wmf.7] Revert "PoolCounterConnectionManager: Add support for ipv6"

https://gerrit.wikimedia.org/r/979080

Mentioned in SAL (#wikimedia-operations) [2023-11-30T16:23:39Z] <ladsgroup@deploy2002> Started scap: Backport for [[gerrit:979080|Revert "PoolCounterConnectionManager: Add support for ipv6" (T352444)]]

Mentioned in SAL (#wikimedia-operations) [2023-11-30T16:26:55Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:979080|Revert "PoolCounterConnectionManager: Add support for ipv6" (T352444)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-11-30T16:33:24Z] <ladsgroup@deploy2002> Finished scap: Backport for [[gerrit:979080|Revert "PoolCounterConnectionManager: Add support for ipv6" (T352444)]] (duration: 09m 45s)

Ladsgroup claimed this task.

The revert is deployed and things seems to be normal now.

Thanks, looks like Poolcounter locks rate resumed. The errors emanating from CirrusSearch have vanished from Logstash. So we are all set.

Follows up to fix the code can be made on the original task which I have reopened: T350615