Page MenuHomePhabricator

Have Shellbox emit metrics
Open, Needs TriagePublic

Description

In addition to logs (T263545), Shellbox should also emit metrics to help measure and assess the health of the application. We should able to use the PHP statsd client library and have it go to a sidecar that prometheus scrapes.

Proposed metrics:

  • counter of requests per endpoint (e.g. Score, imagemagick, etc.)
    • not sure if we need a separate counter for errors, or that can be inferred from logging?
  • timing how long it takes for Shellbox to process each request, split per endpoint

In theory these metrics could also be collected/emitted by the MediaWiki client too.

Event Timeline

I think we will be running every single endpoint as a separate installation (to reduce the attack surface in the single container).

Apart from that, what you described can be covered by a single prometheus histogram, where we should add the following labels:

  • Status code of the response (if correctly codified in the code, this will also give us authorization errors and badly formed requests)
  • "endpoint" (that is, what program is being requested)
  • *maybe* some salient data about the request, like if it contained a pipe?

That's basically all we need to get info about the performance of Shellbox.

But, if we're getting fancy, it would be great to also record the resource usage that php can report in terms of memory used by execution of an external program, although I guess that would require more coding on the shellbox side, and we don't really need that on day 1.