Add monitoring of poolcounter service
Closed, ResolvedPublic
Actions

Description

Background

From https://wikitech.wikimedia.org/wiki/Incident_documentation/20140113-Poolcounter:

We had a partial "outage" of […] causing at least the Main Page not to be displayed, and anon users getting PoolCounter error messages instead.

[There] were no signs that the poolcounter daemons weren't working correctly.

The poolcounter.log on fluorine was being spammed with lots of "Queue full" messages.

Context

In T83729#8695241, @Krinkle wrote:

Translation from 2014 into 2023:

There were no alerts about poolcounter being degraded. "flourine" is a former host for the role now carried by mwlog1002. A proposed source of information to build an alert is the "poolcounter" Logstash channel where MediaWiki logs warnings/errors when it encounters an error.

The objective, however you choose to solve it, is for there to be an alert when MediaWiki is perceiving degraded service from PoolCounter.

{T84143}
T83656}

Details

Reference: rt7108

	Subject	Repo	Branch	Lines +/-
	sre: add alerting for poolcounter	operations/alerts	master	+75 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	Joe	T83729 Add monitoring of poolcounter service
Open	akosiaris	T333947 poolcounter-exporter upgrade

Event Timeline

• rtimport raised the priority of this task from to Medium.Dec 18 2014, 1:52 AM

• rtimport added projects: acl*sre-team, ops-requests.

• rtimport set Reference to rt7108.

greg created this task.Mar 24 2014, 8:06 PM

On Mon Mar 24 20:06:19 2014, ggrossmeier wrote:

From
https://wikitech.wikimedia.org/wiki/Incident_documentation/20140113-
Poolcounter#Actionables

Determin good enough value for check_graphite
MediaWiki.PoolCounter.Client.acquireForAnyone.tp99
Faidon's email from Jan 13th:
We have check_graphite since about a month in our repo, currently used
by a single check. It's pretty good, with lots of options and able to
construct graphite functions by itself too without too much knowledge
of Graphite's internals (which are vast; basically you can apply every
kind of function to the time series data). You can even experiment
with check_gaphite locally, by passing -U
http://graphite.wikimedia.org/ as the URL is open to the public.
Finding good values to alert to isn't very easy, though. I tried
experimenting with e.g.
MediaWiki.PoolCounter.Client.acquireForAnyone.tp99 > 10 as an
alert[1], but it seems that we routinely spike to such values for
brief periods of time, cf.
https://graphite.wikimedia.org/render/?title=PoolCounter%20Client%20Average%20Latency%20%28ms%29%20log%282%29%20-1week&from=-1week&width=1024&height=500&until
=now&areaMode=none&hideLegend=false&logBase=2&lineWidth=1&lineMode=connected&target=cactiStyle%28MediaWiki.PoolCounter.Client.acquireForAnyone.tp99%29
We could do e.g. 95p on the 1-hour time series of 99ps[2] to exclude
spikes, but it seems a bit... suboptimal, albeit better than what we
have.
Also, it's not clear to me if there are circumstances in which
acquireForMe would have issues/spike in latency, but acquireForAnyone
would not (or, similarly, release). Someone more familiar with pool
counter should chime in.
Suggestions about the metric thresholds also very much welcome :)
Regards,
Faidon
1: check_graphite -U http://graphite.wikimedia.org/ -t
MediaWiki.PoolCounter.Client.acquireForAnyone.tp99 --from=-1hour -W 5
-C 10 2: check_graphite -U http://graphite.wikimedia.org/ -t
MediaWiki.PoolCounter.Client.acquireForAnyone.tp99 --from=-1hour
--percentile=95 -W 5 -C 10

Reference by ticket #7126 added by robh

Dependency by ticket #6775 added by ariel

if this needs someone with more poolcounter knowledge i'm afraid that someone
is not going to use RT.. suggesting to create a Bugzilla instead and link it

Status changed from 'new' to 'open' by RT_System

On Thu May 15 15:30:25 2014, dzahn wrote:

if this needs someone with more poolcounter knowledge i'm afraid that
someone
is not going to use RT.. suggesting to create a Bugzilla instead and
link it

The improvement to MW's poolcounter error messages is:
https://bugzilla.wikimedia.org/show_bug.cgi?id=63027
That's probably a first step before improving the monitoring of the service.

On Tue Jul 15 21:19:30 2014, ggrossmeier wrote:

On Thu May 15 15:30:25 2014, dzahn wrote:

if this needs someone with more poolcounter knowledge i'm afraid that
someone
is not going to use RT.. suggesting to create a Bugzilla instead and
link it

The improvement to MW's poolcounter error messages is:
https://bugzilla.wikimedia.org/show_bug.cgi?id=63027

That's probably a first step before improving the monitoring of the service.

I was re-reading the related incident page, it doesn't seem to me there was a
problem with poolcounter itself though?

On Mon, Sep 15, 2014 at 6:43 AM, Filippo Giunchedi via RT
<ops-requests at wikimedia> wrote:

I was re-reading the related incident page, it doesn't seem to me there was a
problem with poolcounter itself though?

Right. This issue/ticket is for adding monitoring of the service (not
improving the service itself).

Since this is about using check_graphite it seems, is this something for
Yuvipanda to pick up?

Given to yuvipanda by dzahn

15:38 < YuviPanda> mutante: oh, sure, add me!

Up to someone not me :)
On Tue, Sep 23, 2014 at 3:34 PM, Daniel Zahn via RT
<ops-requests at wikimedia> wrote:

Since this is about using check_graphite it seems, is this something for
Yuvipanda to pick up?

--
Greg Grossmeier
Release Team Manager

On Tue Sep 23 17:32:17 2014, ggrossmeier wrote:

On Mon, Sep 15, 2014 at 6:43 AM, Filippo Giunchedi via RT
<ops-requests at wikimedia> wrote:

I was re-reading the related incident page, it doesn't seem to me

there was a

problem with poolcounter itself though?

Right. This issue/ticket is for adding monitoring of the service (not
improving the service itself).

agreed!
more to the ticket's point, there doesn't seem to be any agreement of what a
healthy poolcounter service looks like (besides of course the process being
up/down!) anyways I agree with Daniel this seems best to be checked by someone
with poolcounter knowledge

On Wed Sep 24 09:20:12 2014, fgiunchedi wrote:

up/down!) anyways I agree with Daniel this seems best to be checked by
someone
with poolcounter knowledge

I think that would ideally be Aaron Schulz talking with Yuvi about what to
check on graphite exactly.

Reference to ticket #7877 added by dzahn

• Gage added a project: observability.Dec 18 2014, 6:32 PM

• Gage set Security to None.

fgiunchedi claimed this task.Dec 24 2014, 9:29 AM

Dzahn removed a project: ops-requests.Feb 3 2015, 7:50 PM

akosiaris changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Jul 10 2015, 6:07 AM

akosiaris changed the edit policy from "WMF-NDA (Project)" to "All Users".

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 10 2015, 6:07 AM

In T83729#918015, @fgiunchedi wrote:

more to the ticket's point, there doesn't seem to be any agreement of what a healthy poolcounter service looks like (besides of course the process being up/down!)

https://wikitech.wikimedia.org/wiki/PoolCounter#Testing seems like a start

unlikely I'll be able to work on this anytime soon -> up for grabs

fgiunchedi mentioned this in T105378: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site).Jul 22 2015, 12:54 PM

+ MediaWiki-Platform-Team for poolcounter-related tasks

CCicalese_WMF moved this task from Inbox to Watching on the MediaWiki-Platform-Team-Archived board.Jan 9 2018, 2:47 AM

CCicalese_WMF edited projects, added Core-Platform-Team-Old; removed MediaWiki-Platform-Team-Archived.Jul 12 2018, 12:27 AM

CCicalese_WMF moved this task from Inbox to Watching on the Core-Platform-Team-Old board.

Krinkle added a project: PoolCounter.Jul 13 2018, 4:58 AM

Krinkle moved this task from Untriaged to WMF deployment on the PoolCounter board.Jul 13 2018, 5:01 AM

In T133318#3459299, @tstarling wrote:

MW already provides a log of all PoolCounter errors, including queue overflow, in the poolcounter channel. So this is presumably just a matter of monitoring configuration, which my team is not very familiar with.

Krinkle moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Jul 23 2018, 9:05 PM

CCicalese_WMF edited projects, added Platform Team Legacy (Watching / External); removed Core-Platform-Team-Old.Oct 1 2018, 4:51 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 7:54 PM

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

fgiunchedi moved this task from Inbox to Radar on the observability board.Jul 20 2020, 1:14 PM

Krinkle updated the task description. (Show Details)Sep 28 2021, 6:36 PM

Krinkle mentioned this in Blog Post: Production Excellence #46: July & August 2022.Sep 8 2022, 11:02 PM

Removing SRE tag as it is already tagged as "observability"

Restricted Application added a project: Performance-Team. · View Herald TranscriptMar 14 2023, 5:05 PM

Kappakayala removed a project: SRE.Mar 14 2023, 5:05 PM

Krinkle renamed this task from Fix monitoring of poolcounter service to Add monitoring of poolcounter service.Mar 14 2023, 9:15 PM

Krinkle edited projects, added Performance-Team (Radar), serviceops; removed Performance-Team, Platform Team Legacy (Watching / External), observability.

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.

From https://wikitech.wikimedia.org/wiki/Incident_documentation/20140113-Poolcounter:

We had a partial "outage" of […] causing at least the Main Page not to be displayed, and anon users getting PoolCounter error messages instead.

[There] were no signs that the poolcounter daemons weren't working correctly.

The poolcounter.log on fluorine was being spammed with lots of "Queue full" messages.

Translation from 2014 into 2023:

There were no alerts about poolcounter being degraded. "flourine" is a former host for the role now carried by mwlog1002. A proposed source of information to build an alert is the "poolcounter" Logstash channel where MediaWiki logs warnings/errors when it encounters an error.

The objective, however you choose to solve it, is for there to be an alert when MediaWiki is perceiving degraded service from PoolCounter. The above is one way to do that. Noting that we have Prometheus metrics based on Logstash channel/severity (e.g. log_mediawiki_level_channel_doc_count for error/poolcounter). Another way might be to (also or instead) monitor one of PCs own service metrics https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter which should be very similar (apart from problems with configuraton, network, or MW's client). To fully capture even client problems I suppose one could monitor for lack of traffic, e.g. alert if there's a drop in traffic even if the service seems healthy, but it might be easier to alert on the MW log metrics to cover that use case, if it's considered worthwhile.

Krinkle updated the task description. (Show Details)Mar 14 2023, 9:59 PM

Clement_Goubert moved this task from Incoming 🐫 to 🛎 Services & Oids on the serviceops board.Mar 15 2023, 11:45 AM

I frankly prefer to have an alert when a component isn't working, not when it's perceived as not working from one of its clients. We have enough data in the prometheus exporter to be able to alert on most trends.

Change 900962 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/alerts@master] sre: add alerting for poolcounter

https://gerrit.wikimedia.org/r/900962

gerritbot added a project: Patch-For-Review.Mar 20 2023, 7:59 AM

Joe added a project: SRE-Sprint-Week-Sustainability-March2023.Mar 20 2023, 10:43 AM

Joe moved this task from Backlog to Doing on the SRE-Sprint-Week-Sustainability-March2023 board.

Change 900962 merged by jenkins-bot:

[operations/alerts@master] sre: add alerting for poolcounter

https://gerrit.wikimedia.org/r/900962

Maintenance_bot removed a project: Patch-For-Review.Mar 20 2023, 12:30 PM

Joe moved this task from Doing to Done on the SRE-Sprint-Week-Sustainability-March2023 board.Mar 20 2023, 12:34 PM

Joe closed this task as Resolved.Mar 21 2023, 11:42 AM