There should be a framework for running controlled experiments on the application server fleet in production. The purpose of such a framework would be to make it easier and safer to evaluate the real-world impact of software or hardware changes on application server performance. Here is a sketch of what the different parts of such a framework could be:
- A mechanism for applying a label to a subset of application servers, selected randomly or based on some machine attribute.
- Labels are plumbed through to Puppet as facts, making it possible to vary appserver configuration by label.
- Labels are also plumbed through to MediaWiki and other software running on the appserver. This could be done via an environment variable, a file on disk, or through etcd. There are advantages and disadvantages to each approach.
- All structured log messages, application metrics and system-level metrics (Prometheus, excimer samples, etc.) are annotated with the labels.
- This might be a bit tricky to do with statsd metrics.
- Dashboards are available in Grafana that are parametrized by label, making it easy to compare a metric across different groups.
There are many kinds of things that such a framework would help evaluate:
- Hardware tunables, such as CPU frequency scaling behaviors (T315398)
- Tunable kernel parameters
- Various other tunable software parameters
- Different hardware configurations (different processors, different RAM, etc.)
- Different software versions (PHP 7.2 vs 7.4, etc.)
- Different implementations (library X vs library Y)
- Code changes