Maniphest T206939

"Workers" data from prometheus for mw app servers alternates strangely
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	Krinkle
	Oct 14 2018, 2:08 AM

Description

When refreshing the following graph, it seems to keep alternating from one minute to the next between to different pictures:

https://grafana.wikimedia.org/dashboard/db/mediawiki-application-servers?panelId=55&fullscreen&orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1264

A

B

Event Timeline

Krinkle created this task.Oct 14 2018, 2:08 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 14 2018, 2:08 AM

I initially thought it was just a change of color due to the order of the metrics being indeterministic. But, that's not it.

The closing metric visible in figure A is missing from figure B. The logging metric visible in figure B is missing from figure A.

It seems something inside Prometheus, or the collectors, is causing one of the two to constantly be deleted or absent, not even showing in historical data, and then a minute later, its history and current value are back, with another metric's current/historic values missing instead.

The prometheus.svc endpoint in eqiad and codfw is backed by two independent Prometheus servers scraping the same targets. What I suspect has happened is that one of the two servers "catched" workers in state closing or logging while the other didn't. This also suggests to me the exporter doesn't report all metrics it knows about all the time, which leads me to believe that mod_status believes that way (i.e. when no workers are in state closing they are not reported at all).

One potential fix would be for the exporter to report all metrics it knows about all the time, if my assumptions are correct.

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Oct 15 2018, 4:41 PM

jijiki triaged this task as Medium priority.Oct 23 2018, 3:08 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 10:30 PM

fgiunchedi moved this task from Inbox to Backlog on the observability board.Jul 6 2020, 2:04 PM

lmata edited projects, added SRE Observability; removed observability.Jul 12 2021, 2:22 AM

lmata moved this task from Inbox to Backlog on the SRE Observability board.Jul 15 2021, 4:09 AM

lmata edited projects, added Observability-Metrics; removed SRE Observability.Aug 9 2021, 1:12 AM

I've run the following query sum by (state) (apache_workers) and I'm seeing only state busy or idle for the last four weeks:

2022-12-05-155043_1265x834_scrot.png (834×1 px, 85 KB)

I'm tentatively declining as invalid, feel free to reopen

lmata moved this task from Inbox to Done on the Observability-Metrics board.Jan 16 2023, 5:42 PM

	F35838421: 2022-12-05-155043_1265x834_scrot.png
	Dec 5 2022, 2:51 PM

	F26590268: Screen Shot 2018-10-14 at 03.00.55.png
	Oct 14 2018, 2:08 AM

	F26590269: Screen Shot 2018-10-14 at 03.00.51.png
	Oct 14 2018, 2:08 AM

"Workers" data from prometheus for mw app servers alternates strangelyClosed, InvalidPublicActions

Description

Event Timeline

"Workers" data from prometheus for mw app servers alternates strangely
Closed, InvalidPublic
Actions