Framework for running experiments on a subset of the app server fleet
Open, LowPublic
Actions

Assigned To

None

Authored By

	ori
	Aug 17 2022, 5:01 AM

Description

There should be a framework for running controlled experiments on the application server fleet in production. The purpose of such a framework would be to make it easier and safer to evaluate the real-world impact of software or hardware changes on application server performance. Here is a sketch of what the different parts of such a framework could be:

A mechanism for applying a label to a subset of application servers, selected randomly or based on some machine attribute.
Labels are plumbed through to Puppet as facts, making it possible to vary appserver configuration by label.
Labels are also plumbed through to MediaWiki and other software running on the appserver. This could be done via an environment variable, a file on disk, or through etcd. There are advantages and disadvantages to each approach.
All structured log messages, application metrics and system-level metrics (Prometheus, excimer samples, etc.) are annotated with the labels.
- This might be a bit tricky to do with statsd metrics.
Dashboards are available in Grafana that are parametrized by label, making it easy to compare a metric across different groups.

There are many kinds of things that such a framework would help evaluate:

Hardware tunables, such as CPU frequency scaling behaviors (T315398)
Tunable kernel parameters
Various other tunable software parameters
Different hardware configurations (different processors, different RAM, etc.)
Different software versions (PHP 7.2 vs 7.4, etc.)
Different implementations (library X vs library Y)
Code changes

Related Objects
Search...

Status	Assigned	Task
Open	None	T315403 Framework for running experiments on a subset of the app server fleet
Resolved	colewhite	T240685 MediaWiki Prometheus support
Resolved	colewhite	T249164 RFC: Better interface for generating metrics in MediaWiki
Resolved	Krinkle	T292311 Create project tag for MediaWiki-libs-Metrics
Resolved	Krinkle	T292269 Decouple Profiler class from WebRequest and RequestContext
Resolved	Krinkle	T344748 MediaWiki Core - Review and merge StatsLib patch
Resolved	herron	T344751 Decide on default histogram buckets for MediaWiki timers

Event Timeline

ori created this task.Aug 17 2022, 5:01 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 17 2022, 5:01 AM

ori updated the task description. (Show Details)Aug 17 2022, 5:01 AM

ori updated the task description. (Show Details)

ori added a subscriber: Joe.Aug 17 2022, 5:04 AM

CDanis subscribed.Aug 17 2022, 2:15 PM

Summary of a conversation that ori, joe, and I had on IRC today:

You get some of this "for free" once the appservers are on k8s -- you can add labels to your pods that will be automatically propagated to logstash/prometheus
However, it would be valuable to have a framework like this beyond just the appservers or k8s services
- For instance, Traffic has done a lot of that kind of experimentation on cp nodes with ad-hoc mechanisms in the past, same for some other teams
Any Puppet+Prometheus plumbing should be reusable, at least
- in prometheus you can have those same puppet facts exported by node-exporter after having puppet generate a textfile for it, and then, you can join metrics together at query time
Logstash might be more difficult that Prometheus (although I don't know for sure, maybe there's an easy mechanism with a filter script)
- Perhaps those tags could be injected via rsyslog (as configured via puppet)?

Krinkle subscribed.Aug 17 2022, 11:55 PM

Thank you all involved so far! Agreed sth like this would be useful for the non-k8s parts of the infra too. I'm not exactly sure about the logstash bits and how we'd attach tags/attributes (cc @colewhite)

lmata awarded a token.Aug 18 2022, 12:40 PM

Creating a rsyslog template or amending syslog_cee seems the path of least resistance to injecting a Puppet-defined value into the log stream at the host level.

In the case of amending the syslog_cee template, we'd need to define a default value because this would affect all logs that use that template.

For reference:

...
constant(outname="my_label_name" value="my_custom_value" format="jsonf")
constant(value=", ")
...

colewhite added projects: Observability-Metrics, Observability-Logging.Aug 18 2022, 10:27 PM

colewhite moved this task from Inbox to Radar on the Observability-Logging board.

colewhite moved this task from Inbox to Radar on the Observability-Metrics board.

ori added a subtask: T240685: MediaWiki Prometheus support.Aug 19 2022, 2:25 PM

Krinkle edited projects, added SRE, Performance-Team (Radar); removed Performance-Team.Aug 22 2022, 6:23 PM

ori triaged this task as Low priority.Sep 8 2022, 2:31 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Sep 26 2022, 8:24 PM

Just pinging this task as OKR season is upon us and this might be a useful and fun thing to sneak in

Restricted Application added a project: serviceops. · View Herald TranscriptSep 30 2022, 1:06 PM

jijiki moved this task from Incoming 🐫 to 🌻Mediawiki on the serviceops board.Oct 10 2022, 8:53 AM

LSobanski removed a project: collaboration-services.Oct 11 2022, 2:30 PM

Krinkle removed a project: Performance-Team (Radar).Aug 7 2023, 1:24 AM

Krinkle unsubscribed.

colewhite closed subtask T240685: MediaWiki Prometheus support as Resolved.Apr 15 2024, 3:12 PM

Framework for running experiments on a subset of the app server fleetOpen, LowPublicActions

Description

Related ObjectsSearch...

Event Timeline

Framework for running experiments on a subset of the app server fleet
Open, LowPublic
Actions

Related Objects
Search...