Investigation from T150269 and similar tasks over the past 24 months lead me to believe that something in our MediaWiki stack (load balancers, app servers, maybe just api servers, or HHVM lock ups, database slaves or masters) sometimes stall over 1000ms (1s) on an HTTP request that should typically take less than 100ms.
I'd like to start by getting done task T131894: Collect Backend-Timing in Prometheus.
Collecting Backend-Timing headers from a Varnish tailer for all backend requests (e.g. not on cache hits). We'd fragment them at least by request method, and cluster (e.g. text, upload, misc) so that we can e.g. plot p95 and p99 for all MediaWiki GET requests. We may also want to collect them by endpoint (e.g. for load.php).
Regardless of this, though, we need to get data on "simple" requests somehow. Similar to slow queries, measuring all requests together won't be that useful since the problem I'm trying to investigate here applies to "simple" requests that are supposed to consistently take <100ms and are somehow consistently taking at least 1s every 1 in 1000 requests. I've not been able to report this manually, but ResourceLoader consistently records these in Graphite when measuring the time from start to end of ResourceLoaderModule::buildContent(), for modules that have no variance in their content or parameters and yet take <10ms in p99, but 1 in 1000 they take 1, 2 sometimes 25 seconds (!) for the exact same code.