Implement or find a generic leaderboard web interface
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	ori
	Jun 17 2015, 11:54 PM

Description

A number of our most important performance metrics could benefit from an interface that shows the N most expensive / slow / frequent instances of its type (queries, pages, keys, regexes, etc.). It'd be nice to have a webapp for that.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T102899 Implement or find a generic leaderboard web interface
		Resolved		Peter	T98563 Publicly expose (publish) slow parse logs for public Wikimedia wikis

Event Timeline

ori created this task.Jun 17 2015, 11:54 PM

ori raised the priority of this task from to Needs Triage.

ori updated the task description. (Show Details)

ori added a project: Performance Issue.

ori subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 17 2015, 11:54 PM

ori edited projects, added Performance-Team; removed Performance Issue.Jun 18 2015, 7:53 PM

ori set Security to None.

Peter subscribed.Aug 19 2015, 8:22 PM

ori moved this task from Inbox, needs triage to Backlog: Maintenance, non-prioritized on the Performance-Team board.Sep 14 2015, 5:54 PM

Logstash/Kibana dashboards may be a good candidate for doing this in a lightweight manner. We used to have a slow-parse dashboard in the previous version of Kibana that showed the slowest pages to parse (trending quantity of how often parses go the threshold, as well as a breakdown of individual wiki+pagetitle tuples). We can create similar ones for other things we track.

• Gilles triaged this task as Low priority.Dec 7 2016, 5:57 PM

Krinkle moved this task from Backlog: Maintenance, non-prioritized to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.Sep 13 2017, 1:17 PM

For catching slow queries, we can use logging to logstash when the runtime passes a certain threshold (to avoid spamming the service). A leaderboard could be added to Kibana for the top occurrences of normalized messages.

For catching spammy fast queries that take up a large percentage of query time, we could use heavily sampled logging to logstash. A leaderboard could be added to Kibana using the count x runtime of normalized messages (e.g. listing those messages with the highest such values).

I think doing so can let us avoid having an ephemeral leaderboard kept in memory somewhere (e.g. redis or some C daemon).

Sorry to just blurt my ideas with less context, but I <3 performance and wanted to share some stuff. Usually a series of web interfaces are useful for such a task each designed for a drill-down analysis. First the service level investigation dashboard: for queries, considering we have decent instrumentation in services, structured logs and hence metrics be be derived in elasticsearch and then queried as @aaron suggests. We can have a series of nice visualizations in kibana or Grafana (preferably) to help us out there. This would give some service level visibility. We should also have some metrics related to instances attached to the logs.

This will take us to the next series of 'web-interfaces' for a instance level analysis. We could start by something which Netflix uses (Atlas) for a higher view and for each selected instance (on which performance regressions are visible) we can isolate them and fetch much more detailed views (see Netflix Vector). For instance level analysis, the data sources could be profiled samples or trace snapshots (such as those gathered from proc/eBPF or even LTTng)

So the workflow of observation can be identification of slow queries (such as those on one of mediawiki services). Ideally grafana should alert us of such anomalies when a predetermined threshold is reached. It would also identify instances which are affected. This would take us to an instance selection web interface where we can select instance and spawn up a dashboard which contains the views such as individual on-CPU flamegraphs. disk latency etc. sourced from the instance. Suppose, N number of instances on one region showed abnormal block I/O latency, thus affecting a query that involved disk access, thus slowing down Y% of total queries on the infra.

Of course, I am just suggesting. I may be wrong as well. Feel free to assign me some tasks for investigating possibilities/building tools etc. I would be willing to develop some tools related to this effort :)

Krinkle added a subtask: T98563: Publicly expose (publish) slow parse logs for public Wikimedia wikis.Mar 16 2018, 8:56 PM

Krinkle mentioned this in T189284: Stop serving slowparse logs from dumps distribution servers.Mar 16 2018, 9:02 PM

• Gilles closed this task as Declined.Oct 14 2020, 8:44 PM

Implement or find a generic leaderboard web interfaceClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Implement or find a generic leaderboard web interface
Closed, DeclinedPublic
Actions

Related Objects
Search...