Inspired by GOTO 2016 • What I Wish I Had Known Before Scaling Uber to 1000 Services (Matt Ranney).
Call graphs / Flame graphs
Around 21m00s he mentions that while tooling differs between programming languages and frameworks, it was quite useful to have a uniform performance insight by converting the output of those (different) tools to flame graphs, which give a more familiar feeling.
We currently do this only for the MediaWiki run-time (with Xenon for HHVM).
I'd be interesting to get similar statistics going for other run-times and services.
Some services that we should consider adding a sampling profiler, and aggregating logs into daily flame graphs:
- Varnish front-end and back-end. (Wikimedia VCL; operations/puppet)
- RESTBase (Node.js)
- Parsoid (Node.js)
- EventLogging (Python)
- Statsv (Python; analytics/statsv)
RCStream (Python; mediawiki/services/rcstream)- EventStreams (Node.js)
- MediaWiki front-end JavaScript (maybe capture via headless Chrome as part of asset-check.py in operations/puppet)
It'd be great to be able easily dig into any of these services. Both for the teams that maintain these services, as well as for e.g. Performance-Team as part of routine inspection.
Distributed tracing
Aside from flame graphs for individual services (from sampling profilers and/or one-off full debug profiling), we should also look into ways to have a high-level overviews.
These high-level overviews could also include information about backend services (e.g. from MediaWiki to MySQL, from RESTBase to Cassandra, etc.).