Page MenuHomePhabricator

"Workers" data from prometheus for mw app servers alternates strangely
Closed, InvalidPublic

Description

When refreshing the following graph, it seems to keep alternating from one minute to the next between to different pictures:

https://grafana.wikimedia.org/dashboard/db/mediawiki-application-servers?panelId=55&fullscreen&orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1264

A
Screen Shot 2018-10-14 at 03.00.51.png (1×2 px, 283 KB)
B
Screen Shot 2018-10-14 at 03.00.55.png (1×2 px, 298 KB)

Event Timeline

I initially thought it was just a change of color due to the order of the metrics being indeterministic. But, that's not it.

The closing metric visible in figure A is missing from figure B. The logging metric visible in figure B is missing from figure A.

It seems something inside Prometheus, or the collectors, is causing one of the two to constantly be deleted or absent, not even showing in historical data, and then a minute later, its history and current value are back, with another metric's current/historic values missing instead.

The prometheus.svc endpoint in eqiad and codfw is backed by two independent Prometheus servers scraping the same targets. What I suspect has happened is that one of the two servers "catched" workers in state closing or logging while the other didn't. This also suggests to me the exporter doesn't report all metrics it knows about all the time, which leads me to believe that mod_status believes that way (i.e. when no workers are in state closing they are not reported at all).

One potential fix would be for the exporter to report all metrics it knows about all the time, if my assumptions are correct.

jijiki triaged this task as Medium priority.Oct 23 2018, 3:08 PM

I've run the following query sum by (state) (apache_workers) and I'm seeing only state busy or idle for the last four weeks:

2022-12-05-155043_1265x834_scrot.png (834×1 px, 85 KB)

I'm tentatively declining as invalid, feel free to reopen