Page MenuHomePhabricator

Uniform performance insight for different services (tracking)
Closed, ResolvedPublic


Inspired by GOTO 2016 • What I Wish I Had Known Before Scaling Uber to 1000 Services (Matt Ranney).

Call graphs / Flame graphs

Around 21m00s he mentions that while tooling differs between programming languages and frameworks, it was quite useful to have a uniform performance insight by converting the output of those (different) tools to flame graphs, which give a more familiar feeling.

We currently do this only for the MediaWiki run-time (with Xenon for HHVM).

I'd be interesting to get similar statistics going for other run-times and services.

Some services that we should consider adding a sampling profiler, and aggregating logs into daily flame graphs:

  • Varnish front-end and back-end. (Wikimedia VCL; operations/puppet)
  • RESTBase (Node.js)
  • Parsoid (Node.js)
  • EventLogging (Python)
  • Statsv (Python; analytics/statsv)
  • RCStream (Python; mediawiki/services/rcstream)
  • EventStreams (Node.js)
  • MediaWiki front-end JavaScript (maybe capture via headless Chrome as part of in operations/puppet)

It'd be great to be able easily dig into any of these services. Both for the teams that maintain these services, as well as for e.g. Performance-Team as part of routine inspection.

Distributed tracing

Aside from flame graphs for individual services (from sampling profilers and/or one-off full debug profiling), we should also look into ways to have a high-level overviews.

These high-level overviews could also include information about backend services (e.g. from MediaWiki to MySQL, from RESTBase to Cassandra, etc.).

Event Timeline

ori triaged this task as Medium priority.Oct 3 2016, 6:48 PM
ori moved this task from Inbox to Backlog: Maintenance on the Performance-Team board.

Two main tasks stemming this:

  1. Gathering aggregated stack traces from different services in different programming languages. These will likely require different solutions depending on the runtime. We're not aiming for a single solution here. Merely a universal output format as supported by Flame Graph to be published periodically on
  1. Distributed tracing across different services. Especially to accurately measure latency when services call each other, and to be able to see how and in which service that time is being spent. E.g. through waterfall graphs that show how one service calls another.

Depending on whether this second point includes in-process tracing, it will depend on point 1.

I'd like to prioritise point 1 for the time being. Although while we work on that, we can start thinking about the requirements for point 2. For example how we'll associate the different service requests (e.g. passing around a header? and evaluating which -if any- utility libraries we might need to deploy inside these services in different languages.)

A few relevant links as shared in the triage meeting earlier today:

Possibly relevant to 2: just came across this yesterday

Krinkle renamed this task from Uniform performance insight for different services to Uniform performance insight for different services (tracking).Jan 18 2018, 5:49 PM
Krinkle added a project: Epic.
Krinkle claimed this task.