Page MenuHomePhabricator

global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus
Closed, ResolvedPublic

Description

As the Frontend Traffic dashboard presently exists, it reports an availability fraction -- just 1 - ((5XX responses) / (all requests)) -- for each site (eqiad, esams, ...) and cache type (text, upload).

That's all fine and good. But then it computes a 'global' availability number simply by adding together all the ratios from each site -- which have wildly different denominators -- and then subtracting 1 from the whole sum. This doesn't produce a meaningful number.

Here's an illustration, with the status quo in thick red and the correct computation in thick green:
https://grafana.wikimedia.org/d/uiBRF5hWk/xxx-cdanis-frontend-traffic-copy?panelId=3&fullscreen&orgId=1&from=1570101355987&to=1570110394094

Event Timeline

Change 540676 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] prometheus global: add rules for correct global HTTP avail

https://gerrit.wikimedia.org/r/540676

CDanis triaged this task as Medium priority.Oct 3 2019, 8:35 PM
CDanis updated the task description. (Show Details)

Change 540676 merged by CDanis:
[operations/puppet@production] prometheus global: add rules for correct global HTTP avail

https://gerrit.wikimedia.org/r/540676

We probably want to let the new recording rule accumulate some data -- a week's worth? -- and then start using it in the dashboard

+1 on at least a week's worth of data to change dashboards and alerts

I've updated the frontend-traffic dashboard to include global availability correctly, and got rid of the summed value

CDanis claimed this task.