Page MenuHomePhabricator

Properly support php7.4 across the observability stack
Closed, ResolvedPublic

Description

While metrics are emitted by every php version, we need to check/fix the following things:

  • mtail regex doesn't match the php 7.4 handler
  • flamegraphs from xhgui - do we split on php version already? we probably need to
  • logstash dashboards - they should already include a filter based on the php version, but better check

Export and collect phpfpm metrics with prometheus-phpfpm-exporter for all versions of php.

  • prometheus needs to collect php7.4 metrics
  • grafana dashboards should at the very least allow selecting which version of php to collect data for

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
Resolved toan
ResolvedLucas_Werkmeister_WMDE
ResolvedJoe
ResolvedJdforrester-WMF
Resolved Ladsgroup
InvalidNone
ResolvedReedy
OpenNone
Resolvedtstarling
ResolvedJdforrester-WMF
ResolvedPRODUCTION ERRORLegoktm
Resolvedtstarling
ResolvedJoe
ResolvedJoe
ResolvedJoe
ResolvedJoe

Event Timeline

Joe triaged this task as High priority.Jul 8 2022, 10:41 AM

Change 811934 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mtail: fix regexes due to changes in apache configuration

https://gerrit.wikimedia.org/r/811934

Change 811934 merged by Giuseppe Lavagetto:

[operations/puppet@production] mtail: fix regexes due to changes in apache configuration

https://gerrit.wikimedia.org/r/811934

Prometheus already collects the metrics for all php versions as the php monitoring already proxies all results to the main port.

Adding an exporter per php version is relatively easy, we can do so that for each php version we launch one instance of php-fpm-exporter on a different port.

The downside of doing so is that then you need:

  • To add appropriate labels to the collector in prometheus/ops.pp, which will anyways be static, not dynamic
  • To ensure we never change the order of ports that correspond to each php version, else the labeling will just get mixed up

I don't really like this option because it makes what was supposed to be a semi-automatic process of installing new php versions add a couple manual steps.

An alternative would be modifying the php-fpm exporter to be able to query multiple backends and adding an appropriate label to those. I took a look at the code and it doesn't look very hard to do.

Adding an exporter per php version is relatively easy, we can do so that for each php version we launch one instance of php-fpm-exporter on a different port.

The downside of doing so is that then you need:

  • To add appropriate labels to the collector in prometheus/ops.pp, which will anyways be static, not dynamic
  • To ensure we never change the order of ports that correspond to each php version, else the labeling will just get mixed up

I don't really like this option because it makes what was supposed to be a semi-automatic process of installing new php versions add a couple manual steps.

An alternative would be modifying the php-fpm exporter to be able to query multiple backends and adding an appropriate label to those. I took a look at the code and it doesn't look very hard to do.

(following up from IRC) on balance I agree that adapting php-fpm-exporter to be able to target multiple backends is the right way to go

There is a plot twist: I realized that anything currently offered by php-fpm-exporter of any value to us is also exported by the php-fpm-statustext exporter. So I think we can just stop collecting the data soon, then uninstall the php-fpm exporter if needed.

The MediaWiki errors collected in logstash do indeed distinguish between php versions, as I verified by causing a fatal error in a php 7.4 request. That was correctly categorized in logstash.

What's left is, at this point, to remove prometheus-php-fpm-exporter from production, and to split flamegraphs.

@Krinkle do you think we need to split flamegraphs? I would tend to say the answer is "yes", but IIRC that required some more work than just a patch to mediawiki-config, so I'd check with you before proceeding :)

It'd be nice to have a comparison. But with T244776 and T200108 still unresolved, it's not immediately obvious how safe it would be to add another doubling for fragmentation.

If these use cases keep coming up, we may want to replace fake "entrypoint" frame with a more flexible "tag" feature that we append to the trace log (e.g. instead of /w/index.php;Foo;Bar 1 we'd have Foo;Bar 1 # index.php GET k8s php74 or some such, which doesn't require a new duplicate stream of samples to expand. Instead, they would be consumed by the arclamp processor and flushed to different flame graphs accordingly.

For now, I wouldn't mind skipping this for now. We have a number of other insights to appserver performance already.

What would help, though, is to perhaps have for the majority of web traffic (api_appserver,appserver) a 24-hour period back to only php72 after we've completed most of the rollout, to have a more stark contrast for the long tail of other metrics that we don't vary, so as to confirm or rule-out anything that might slowly change — unless we ramp up api_appserver and appserver in 7 days or less — in which case that'll provide enough week-over-week confidence in my opinion.