Page MenuHomePhabricator

Add monitoring of poolcounter service
Closed, ResolvedPublic

Description

Background

We had a partial "outage" of […] causing at least the Main Page not to be displayed, and anon users getting PoolCounter error messages instead.

[There] were no signs that the poolcounter daemons weren't working correctly.

The poolcounter.log on fluorine was being spammed with lots of "Queue full" messages.

Context

Translation from 2014 into 2023:

There were no alerts about poolcounter being degraded. "flourine" is a former host for the role now carried by mwlog1002. A proposed source of information to build an alert is the "poolcounter" Logstash channel where MediaWiki logs warnings/errors when it encounters an error.

The objective, however you choose to solve it, is for there to be an alert when MediaWiki is perceiving degraded service from PoolCounter.

Related:

  • {T84143}
  • T83656}

Details

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 1:52 AM
rtimport set Reference to rt7108.

On Mon Mar 24 20:06:19 2014, ggrossmeier wrote:

From
https://wikitech.wikimedia.org/wiki/Incident_documentation/20140113-
Poolcounter#Actionables

Determin good enough value for check_graphite
MediaWiki.PoolCounter.Client.acquireForAnyone.tp99
Faidon's email from Jan 13th:
We have check_graphite since about a month in our repo, currently used
by a single check. It's pretty good, with lots of options and able to
construct graphite functions by itself too without too much knowledge
of Graphite's internals (which are vast; basically you can apply every
kind of function to the time series data). You can even experiment
with check_gaphite locally, by passing -U
http://graphite.wikimedia.org/ as the URL is open to the public.
Finding good values to alert to isn't very easy, though. I tried
experimenting with e.g.
MediaWiki.PoolCounter.Client.acquireForAnyone.tp99 > 10 as an
alert[1], but it seems that we routinely spike to such values for
brief periods of time, cf.
https://graphite.wikimedia.org/render/?title=PoolCounter%20Client%20Average%20Latency%20%28ms%29%20log%282%29%20-1week&from=-1week&width=1024&height=500&until
=now&areaMode=none&hideLegend=false&logBase=2&lineWidth=1&lineMode=connected&target=cactiStyle%28MediaWiki.PoolCounter.Client.acquireForAnyone.tp99%29
We could do e.g. 95p on the 1-hour time series of 99ps[2] to exclude
spikes, but it seems a bit... suboptimal, albeit better than what we
have.
Also, it's not clear to me if there are circumstances in which
acquireForMe would have issues/spike in latency, but acquireForAnyone
would not (or, similarly, release). Someone more familiar with pool
counter should chime in.
Suggestions about the metric thresholds also very much welcome :)
Regards,
Faidon
1: check_graphite -U http://graphite.wikimedia.org/ -t
MediaWiki.PoolCounter.Client.acquireForAnyone.tp99 --from=-1hour -W 5
-C 10 2: check_graphite -U http://graphite.wikimedia.org/ -t
MediaWiki.PoolCounter.Client.acquireForAnyone.tp99 --from=-1hour
--percentile=95 -W 5 -C 10

Reference by ticket #7126 added by robh

Dependency by ticket #6775 added by ariel

if this needs someone with more poolcounter knowledge i'm afraid that someone
is not going to use RT.. suggesting to create a Bugzilla instead and link it

Status changed from 'new' to 'open' by RT_System

On Thu May 15 15:30:25 2014, dzahn wrote:

if this needs someone with more poolcounter knowledge i'm afraid that
someone
is not going to use RT.. suggesting to create a Bugzilla instead and
link it

The improvement to MW's poolcounter error messages is:
https://bugzilla.wikimedia.org/show_bug.cgi?id=63027
That's probably a first step before improving the monitoring of the service.

On Tue Jul 15 21:19:30 2014, ggrossmeier wrote:

On Thu May 15 15:30:25 2014, dzahn wrote:

if this needs someone with more poolcounter knowledge i'm afraid that
someone
is not going to use RT.. suggesting to create a Bugzilla instead and
link it

The improvement to MW's poolcounter error messages is:
https://bugzilla.wikimedia.org/show_bug.cgi?id=63027

That's probably a first step before improving the monitoring of the service.

I was re-reading the related incident page, it doesn't seem to me there was a
problem with poolcounter itself though?

On Mon, Sep 15, 2014 at 6:43 AM, Filippo Giunchedi via RT
<ops-requests at wikimedia> wrote:

I was re-reading the related incident page, it doesn't seem to me there was a
problem with poolcounter itself though?

Right. This issue/ticket is for adding monitoring of the service (not
improving the service itself).

Since this is about using check_graphite it seems, is this something for
Yuvipanda to pick up?

15:38 < YuviPanda> mutante: oh, sure, add me!

Up to someone not me :)
On Tue, Sep 23, 2014 at 3:34 PM, Daniel Zahn via RT
<ops-requests at wikimedia> wrote:

Since this is about using check_graphite it seems, is this something for
Yuvipanda to pick up?

--
Greg Grossmeier
Release Team Manager

On Tue Sep 23 17:32:17 2014, ggrossmeier wrote:

On Mon, Sep 15, 2014 at 6:43 AM, Filippo Giunchedi via RT
<ops-requests at wikimedia> wrote:

I was re-reading the related incident page, it doesn't seem to me

there was a

problem with poolcounter itself though?

Right. This issue/ticket is for adding monitoring of the service (not
improving the service itself).

agreed!
more to the ticket's point, there doesn't seem to be any agreement of what a
healthy poolcounter service looks like (besides of course the process being
up/down!) anyways I agree with Daniel this seems best to be checked by someone
with poolcounter knowledge

On Wed Sep 24 09:20:12 2014, fgiunchedi wrote:

up/down!) anyways I agree with Daniel this seems best to be checked by
someone
with poolcounter knowledge

I think that would ideally be Aaron Schulz talking with Yuvi about what to
check on graphite exactly.

Reference to ticket #7877 added by dzahn

akosiaris changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Jul 10 2015, 6:07 AM
akosiaris changed the edit policy from "WMF-NDA (Project)" to "All Users".

more to the ticket's point, there doesn't seem to be any agreement of what a healthy poolcounter service looks like (besides of course the process being up/down!)

https://wikitech.wikimedia.org/wiki/PoolCounter#Testing seems like a start

unlikely I'll be able to work on this anytime soon -> up for grabs

Krinkle added a project: Wikimedia-Incident.
Krinkle added subscribers: MaxSem, tstarling.
Krinkle subscribed.

MW already provides a log of all PoolCounter errors, including queue overflow, in the poolcounter channel. So this is presumably just a matter of monitoring configuration, which my team is not very familiar with.

Removing SRE tag as it is already tagged as "observability"

Krinkle renamed this task from Fix monitoring of poolcounter service to Add monitoring of poolcounter service.Mar 14 2023, 9:15 PM
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.

We had a partial "outage" of […] causing at least the Main Page not to be displayed, and anon users getting PoolCounter error messages instead.

[There] were no signs that the poolcounter daemons weren't working correctly.

The poolcounter.log on fluorine was being spammed with lots of "Queue full" messages.

Translation from 2014 into 2023:

There were no alerts about poolcounter being degraded. "flourine" is a former host for the role now carried by mwlog1002. A proposed source of information to build an alert is the "poolcounter" Logstash channel where MediaWiki logs warnings/errors when it encounters an error.

The objective, however you choose to solve it, is for there to be an alert when MediaWiki is perceiving degraded service from PoolCounter. The above is one way to do that. Noting that we have Prometheus metrics based on Logstash channel/severity (e.g. log_mediawiki_level_channel_doc_count for error/poolcounter). Another way might be to (also or instead) monitor one of PCs own service metrics https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter which should be very similar (apart from problems with configuraton, network, or MW's client). To fully capture even client problems I suppose one could monitor for lack of traffic, e.g. alert if there's a drop in traffic even if the service seems healthy, but it might be easier to alert on the MW log metrics to cover that use case, if it's considered worthwhile.

I frankly prefer to have an alert when a component isn't working, not when it's perceived as not working from one of its clients. We have enough data in the prometheus exporter to be able to alert on most trends.

Change 900962 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/alerts@master] sre: add alerting for poolcounter

https://gerrit.wikimedia.org/r/900962

Change 900962 merged by jenkins-bot:

[operations/alerts@master] sre: add alerting for poolcounter

https://gerrit.wikimedia.org/r/900962