Page MenuHomePhabricator

Framework for running experiments on a subset of the app server fleet
Open, LowPublic

Description

There should be a framework for running controlled experiments on the application server fleet in production. The purpose of such a framework would be to make it easier and safer to evaluate the real-world impact of software or hardware changes on application server performance. Here is a sketch of what the different parts of such a framework could be:

  • A mechanism for applying a label to a subset of application servers, selected randomly or based on some machine attribute.
  • Labels are plumbed through to Puppet as facts, making it possible to vary appserver configuration by label.
  • Labels are also plumbed through to MediaWiki and other software running on the appserver. This could be done via an environment variable, a file on disk, or through etcd. There are advantages and disadvantages to each approach.
  • All structured log messages, application metrics and system-level metrics (Prometheus, excimer samples, etc.) are annotated with the labels.
    • This might be a bit tricky to do with statsd metrics.
  • Dashboards are available in Grafana that are parametrized by label, making it easy to compare a metric across different groups.

There are many kinds of things that such a framework would help evaluate:

  • Hardware tunables, such as CPU frequency scaling behaviors (T315398)
  • Tunable kernel parameters
  • Various other tunable software parameters
  • Different hardware configurations (different processors, different RAM, etc.)
  • Different software versions (PHP 7.2 vs 7.4, etc.)
  • Different implementations (library X vs library Y)
  • Code changes

Event Timeline

ori updated the task description. (Show Details)

Summary of a conversation that ori, joe, and I had on IRC today:

  • You get some of this "for free" once the appservers are on k8s -- you can add labels to your pods that will be automatically propagated to logstash/prometheus
  • However, it would be valuable to have a framework like this beyond just the appservers or k8s services
    • For instance, Traffic has done a lot of that kind of experimentation on cp nodes with ad-hoc mechanisms in the past, same for some other teams
  • Any Puppet+Prometheus plumbing should be reusable, at least
    • in prometheus you can have those same puppet facts exported by node-exporter after having puppet generate a textfile for it, and then, you can join metrics together at query time
  • Logstash might be more difficult that Prometheus (although I don't know for sure, maybe there's an easy mechanism with a filter script)
    • Perhaps those tags could be injected via rsyslog (as configured via puppet)?

Thank you all involved so far! Agreed sth like this would be useful for the non-k8s parts of the infra too. I'm not exactly sure about the logstash bits and how we'd attach tags/attributes (cc @colewhite)

Creating a rsyslog template or amending syslog_cee seems the path of least resistance to injecting a Puppet-defined value into the log stream at the host level.

In the case of amending the syslog_cee template, we'd need to define a default value because this would affect all logs that use that template.

For reference:

...
constant(outname="my_label_name" value="my_custom_value" format="jsonf")
constant(value=", ")
...
ori triaged this task as Low priority.Sep 8 2022, 2:31 PM

Just pinging this task as OKR season is upon us and this might be a useful and fun thing to sneak in