Page MenuHomePhabricator

Reliable metrics for idle/busy PHP-FPM workers
Closed, ResolvedPublic

Description

We use php-fpm-exporter to export Prometheus metrics about PHP-FPM's internal state, such as the number of active vs idle workers.

However, in order for php-fpm to serve its own status information (scraped by the exporter), at least one worker thread must be idle. This is often an issue in saturation events, where we lose PHP monitoring data entirely during the affected interval, on either a small number of appservers or on the whole fleet.

There's a couple options for getting around this:

  1. Create a dedicated statistics worker. We'd get a full export of all of php-fpm's current metrics, but this would also involve major modifications to the php-fpm source.
  2. Have workers report their idle/busy status via some out-of-band mechanism, in addition to the existing one.

It turns out #2 is pretty easy to implement. P11187 is a patch that simply updates each worker's argv with 'busy' or 'idle' whenever the state changes:

1Every 1.0s: systemctl status php7.2-fpm deployment-mediawiki-07: Tue May 12 19:35:56 2020
2
3โ— php7.2-fpm.service - The PHP 7.2 FastCGI Process Manager
4 Loaded: loaded (/lib/systemd/system/php7.2-fpm.service; enabled; vendor preset: enabled)
5 Drop-In: /etc/systemd/system/php7.2-fpm.service.d
6 โ””โ”€puppet-override.conf
7 Active: active (running) since Tue 2020-05-12 19:35:01 UTC; 54s ago
8 Docs: man:php-fpm7.2(8)
9 Main PID: 9824 (php-fpm7.2)
10 Status: "Processes active: 0, idle: 8, Requests: 41, slow: 0, Traffic: 2.3req/sec"
11 Tasks: 17 (limit: 4915)
12 CGroup: /system.slice/php7.2-fpm.service
13 โ”œโ”€9824 php-fpm: master process (/etc/php/7.2/fpm/php-fpm.conf)
14 โ”œโ”€9839 php-fpm: pool www: busy
15 โ”œโ”€9841 php-fpm: pool www: idle
16 โ”œโ”€9842 php-fpm: pool www: idle
17 โ”œโ”€9843 php-fpm: pool www: idle
18 โ”œโ”€9844 php-fpm: pool www: idle
19 โ”œโ”€9845 php-fpm: pool www: idle
20 โ”œโ”€9846 php-fpm: pool www: idle
21 โ””โ”€9847 php-fpm: pool www: idle
22
23

(The reason for the difference between the per-process reported state and the Status: "Processes active: 0, idle 8 state that php-fpm already provides to systemd is that the latter is only updated once every 10 seconds.)

Changing argv is implemented via PHP's own emulation of BSD's setproctitle() -- in this case, a memset() followed by some strncopy()s, all smallish. Very likely fast enough to do on every request.

Looking at history, fpm_requests.c doesn't change too often (a few times a year), so maintaining this patch on our own seems quite feasible -- but that being said, it still seems like a fine candidate for upstreaming, assuming it works out well for us.

From there we could have a trivial exporter that scanned php-fpm process argvs and exported a state gauge.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptMay 12 2020, 10:09 PM
CDanis triaged this task as Medium priority.May 12 2020, 10:11 PM
@CDanis wrote:

(The reason for the difference between the per-process reported state and the Status: "Processes active: 0, idle 8 state that php-fpm already provides to systemd is that the latter is only updated once every 10 seconds.)

Is 10s is not frequent enough? Or is this data not reliable (e.g. can be stale, like our current), or is it not feasible to scrape?

@CDanis wrote:

(The reason for the difference between the per-process reported state and the Status: "Processes active: 0, idle 8 state that php-fpm already provides to systemd is that the latter is only updated once every 10 seconds.)

Is 10s is not frequent enough? Or is this data not reliable (e.g. can be stale, like our current), or is it not feasible to scrape?

Funnily, I didn't even notice systemctl's Status: line (set via sd_notify) until I was done implementing the argv-mutating version and on my second pass of writing up this task :)

This would likely work fine as well. 10s granularity is fine, but there's still something of a scraping issue; the best way I've found to get the data programmatically is to invoke systemctl:

โœ”๏ธ cdanis@mw1398.eqiad.wmnet ~ ๐Ÿ•ก๐Ÿบ systemctl show php7.2-fpm --property StatusText
StatusText=Processes active: 11, idle: 61, Requests: 145700715, slow: 2439, Traffic: 115req/sec

Another option would be to talk to systemd directly via its D-Bus protocol, but I think calling systemctl would be preferable.

Then we have to parse the line itself, although that seems easy, and I don't think the format string changes often.

So maybe that's the best thing to do here.

FWIW, I think I remember systemctl status php7.2-fpm to stall on a busy server, but I might remember incorrectly.

I took a brief peek at what flows to systemd from php-fpm on dbus:

$ sudo dbus-monitor --system path='/org/freedesktop/systemd1/unit/php7_2e2_2dfpm_2eservice
...

and saw that the status is passed as a statustext string, so we'd need to decode that too. It's ok if we want to either:

  • Listen to dbus, extract the status text and the timestamp of the event
  • run systemctl show php7.2-fpm -p StatusText and parse the status line. We lose the ability to save the notification timestamp, but we gain in simplicity I guess.

Discussed some with Joe on IRC and the consensus approach for now is to, for now, write a textfile exporter that parses the systemd status line, and perhaps later pursue having php-fpm-exporter fall back to parsing the systemd status line if its HTTP request times out.

I also submitted the argv-mutating code upstream as https://github.com/php/php-src/pull/5567 because I think it's otherwise useful for understanding what's happening on a server at a glance.

Change 601454 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] textfile exporter for php-fpm worker pool status

https://gerrit.wikimedia.org/r/601454

Mentioned in SAL (#wikimedia-operations) [2020-06-08T14:47:46Z] <cdanis> โœ”๏ธ cdanis@cumin1001.eqiad.wmnet ~ ๐Ÿ•šโ˜• sudo cumin A:mw-canary 'disable-puppet "cdanis deploying I25ab44c1 T252605"'

Change 601454 merged by CDanis:
[operations/puppet@production] textfile exporter for php-fpm worker pool status

https://gerrit.wikimedia.org/r/601454

Mentioned in SAL (#wikimedia-operations) [2020-06-08T14:53:08Z] <cdanis> โœ”๏ธ cdanis@cumin1001.eqiad.wmnet ~ ๐Ÿ•šโ˜• sudo cumin A:mw-canary 'enable-puppet "cdanis deploying I25ab44c1 T252605"'

Change 603511 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] expand phpfpm status text exporter to all appservers

https://gerrit.wikimedia.org/r/603511

Change 603511 merged by CDanis:
[operations/puppet@production] expand phpfpm status text exporter to all appservers

https://gerrit.wikimedia.org/r/603511

Change 606519 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] add alerting for Mediawiki PHP-FPM worker pool saturation

https://gerrit.wikimedia.org/r/606519

Change 606519 merged by CDanis:
[operations/puppet@production] add alerting for Mediawiki PHP-FPM worker pool saturation

https://gerrit.wikimedia.org/r/606519

CDanis closed this task as Resolved.EditedJun 18 2020, 10:05 PM

We now have an alert and a graph based on scraping the status string that php-fpm provides to systemd, which is reliable even in the face of extreme saturation.

I haven't heard back from php-fpm upstream in a while re: my pull request, but it sounds like they were thinking about building a dedicated /status-only worker, if that proved to not be too hard. If that did come together, it would let us delete the textfile exporter and solely use the preexisting php-fpm-exporter.

Change 607163 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] MW PHP-FPM worker saturation: make it page

https://gerrit.wikimedia.org/r/607163

Change 607163 merged by CDanis:
[operations/puppet@production] MW PHP-FPM worker saturation: make it page

https://gerrit.wikimedia.org/r/c/operations/puppet/ /607163