Page MenuHomePhabricator

ORES timeout error graph is incorrect
Closed, InvalidPublic

Description

During this stress test, https://grafana-admin.wikimedia.org/dashboard/db/ores?orgId=1&from=1513256100000&to=1513257600000
the timeout error graph shows about 40 timeout errors, but the testing tool reported more than 13,000.

Event Timeline

There's one code path that can throw a TimeoutError without adding to this metric, it's the outer timeout in ores/util.py. Interesting that we're hitting this code, I don't think that should be happening.

Meanwhile, I'll have it emit metrics.

The last comment was wrong, I see how the timeout is caught and metrics are recorded. I currently can't find any code paths to explain the missing metrics.

Ladsgroup triaged this task as Medium priority.Nov 28 2018, 6:30 AM
ACraze added a subscriber: ACraze.

This issue is no longer valid after moving to Prometheus