A number of our most important performance metrics could benefit from an interface that shows the N most expensive / slow / frequent instances of its type (queries, pages, keys, regexes, etc.). It'd be nice to have a webapp for that.
|Open||None||T102899 Implement or find a generic leaderboard web interface|
|Resolved||Peter||T98563 Publicly expose (publish) slow parse logs for public Wikimedia wikis|
Logstash/Kibana dashboards may be a good candidate for doing this in a lightweight manner. We used to have a slow-parse dashboard in the previous version of Kibana that showed the slowest pages to parse (trending quantity of how often parses go the threshold, as well as a breakdown of individual wiki+pagetitle tuples). We can create similar ones for other things we track.
For catching slow queries, we can use logging to logstash when the runtime passes a certain threshold (to avoid spamming the service). A leaderboard could be added to Kibana for the top occurrences of normalized messages.
For catching spammy fast queries that take up a large percentage of query time, we could use heavily sampled logging to logstash. A leaderboard could be added to Kibana using the count x runtime of normalized messages (e.g. listing those messages with the highest such values).
I think doing so can let us avoid having an ephemeral leaderboard kept in memory somewhere (e.g. redis or some C daemon).
Sorry to just blurt my ideas with less context, but I <3 performance and wanted to share some stuff. Usually a series of web interfaces are useful for such a task each designed for a drill-down analysis. First the service level investigation dashboard: for queries, considering we have decent instrumentation in services, structured logs and hence metrics be be derived in elasticsearch and then queried as @aaron suggests. We can have a series of nice visualizations in kibana or Grafana (preferably) to help us out there. This would give some service level visibility. We should also have some metrics related to instances attached to the logs.
This will take us to the next series of 'web-interfaces' for a instance level analysis. We could start by something which Netflix uses (Atlas) for a higher view and for each selected instance (on which performance regressions are visible) we can isolate them and fetch much more detailed views (see Netflix Vector). For instance level analysis, the data sources could be profiled samples or trace snapshots (such as those gathered from proc/eBPF or even LTTng)
So the workflow of observation can be identification of slow queries (such as those on one of mediawiki services). Ideally grafana should alert us of such anomalies when a predetermined threshold is reached. It would also identify instances which are affected. This would take us to an instance selection web interface where we can select instance and spawn up a dashboard which contains the views such as individual on-CPU flamegraphs. disk latency etc. sourced from the instance. Suppose, N number of instances on one region showed abnormal block I/O latency, thus affecting a query that involved disk access, thus slowing down Y% of total queries on the infra.
Of course, I am just suggesting. I may be wrong as well. Feel free to assign me some tasks for investigating possibilities/building tools etc. I would be willing to develop some tools related to this effort :)