global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	CDanis
	Oct 3 2019, 8:29 PM

Description

As the Frontend Traffic dashboard presently exists, it reports an availability fraction -- just 1 - ((5XX responses) / (all requests)) -- for each site (eqiad, esams, ...) and cache type (text, upload).

That's all fine and good. But then it computes a 'global' availability number simply by adding together all the ratios from each site -- which have wildly different denominators -- and then subtracting 1 from the whole sum. This doesn't produce a meaningful number.

Here's an illustration, with the status quo in thick red and the correct computation in thick green:
https://grafana.wikimedia.org/d/uiBRF5hWk/xxx-cdanis-frontend-traffic-copy?panelId=3&fullscreen&orgId=1&from=1570101355987&to=1570110394094

Details

	Subject	Repo	Branch	Lines +/-
	prometheus global: add rules for correct global HTTP avail	operations/puppet	production	+16 -0

Customize query in gerrit

Related Objects

Mentioned In: T236367: Tune HTTP availability alerts

Event Timeline

CDanis created this task.Oct 3 2019, 8:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 3 2019, 8:29 PM

Change 540676 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] prometheus global: add rules for correct global HTTP avail

https://gerrit.wikimedia.org/r/540676

gerritbot added a project: Patch-For-Review.Oct 3 2019, 8:31 PM

CDanis triaged this task as Medium priority.Oct 3 2019, 8:35 PM

CDanis updated the task description. (Show Details)

fgiunchedi awarded a token.Oct 7 2019, 8:24 PM

Change 540676 merged by CDanis:
[operations/puppet@production] prometheus global: add rules for correct global HTTP avail

https://gerrit.wikimedia.org/r/540676

We probably want to let the new recording rule accumulate some data -- a week's worth? -- and then start using it in the dashboard

Maintenance_bot removed a project: Patch-For-Review.Oct 9 2019, 5:10 PM

Paladox subscribed.Oct 9 2019, 6:24 PM

+1 on at least a week's worth of data to change dashboards and alerts

• ema moved this task from Backlog to Caching on the Traffic board.Oct 14 2019, 5:37 PM

I've updated the frontend-traffic dashboard to include global availability correctly, and got rid of the summed value

fgiunchedi mentioned this in T236367: Tune HTTP availability alerts.Oct 24 2019, 10:29 AM

fgiunchedi moved this task from Inbox to In progress on the observability board.Oct 28 2019, 2:17 PM

CDanis closed this task as Resolved.Oct 28 2019, 3:08 PM

CDanis claimed this task.

global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogusClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus
Closed, ResolvedPublic
Actions