Page MenuHomePhabricator

Set up a statsv-like endpoint for Prometheus
Open, MediumPublic

Description

In order to generate correct percentiles over a long period of time, have histograms and heatmaps, we need a simple way to send metrics to Prometheus storage. The convenience of statsv would be a nice way to try this out.

Event Timeline

Gilles created this task.Nov 9 2017, 9:55 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 9 2017, 9:55 AM
Gilles updated the task description. (Show Details)Nov 9 2017, 9:59 AM
Gilles updated the task description. (Show Details)

Given the structural differences between Statsd/Graphite (push) and Prometheus (pull), and various other aspects (such as larger number of value types), I think it'd be wise to abandon the arbitrary key-setting methodology of Statsv.

Firstly, given the pull requirement, we'd either need to combine them all in a single process of our own (like the current statsv), or use a Push Gateway.

Secondly, Prometheus tries very hard to encourage good and scalable models by requiring a known type for each metric, and it generally encourages few labels to be specified, because they basically get expanded into a matrix of all possible combinations (e.g. foo=123{x=this, y=that} is not very different from manually doing in Graphite: foo.all=123, foo.x_this=123, foo.y_that=123, foo.x_this_y_that=123.

I think we should start with restricted model where each key, type and allowed labels are declared ahead of time (e.g. in Puppet or Hiera). E.g. we should not allow external requests to override the type for existing keys or set arbitrary labels, which could really mess up our data. Which might actually be a good reason to set up a separate Prometheus instance for this. Ops already uses a Prometheus Federation where data comes from multiple different servers ("namespaces").

Basically very basic schema-like validation. Could be as simple as:

some_key_here:
 type: Counter # or Histogram, or Summary
 labels_required:
    - something
 labels_optional:
    - else

some_other_thing:
 type: Histogram
 labels_required:
    - foo
    - bar
Krinkle triaged this task as Medium priority.Nov 13 2017, 9:35 PM

+1 to what @Krinkle said re: metrics and validation.

WRT Prometheus retention, a disclaimer: the model is different than what we are used with Graphite. Prometheus' local storage isn't advertised for long term retention of data, though I've seen reports of people storing and querying one year of data without problems, Prometheus 2 AFAIUI makes it even less resource intensive to store and expire long term data. The limiting factor is essentially memory to load all datapoints for a given time span. To give a concrete example, if we are querying a single metric with 60s datapoints for one year that's ~700k at ~1.3bytes/datapoint so definitely querable.
Also note that differently than Graphite, a single Prometheus instance will store all datapoints unsampled for its configured retention period (configured per-instance, not per-metric like Graphite)

We currently have per-site prometheus instances that store all datapoints with short retention (90d as of today, but we are looking at extending it as space permits) and a global instance that pulls aggregated data from each per-site instance for long term retention. Here "long" means one year as of today, though we're looking at extending that to two years or more.

Prometheus also supports writing and reading to remote storage so that's something we can explore too for longer term retention of data. I'm not aware of any integration that uses distributed storage we already deploy (e.g cassandra, swift, hdfs/druid/spark) though it should be technically possible to build on top of those.

Hm, could we possibly use EventLogging (or similar?) system for this? Incoming valid EventLogging data goes to a Kafka topic anyway. The data would then be available in Hive/Hadoop/Spark for historical querying (although we'd have to whitelist it from purging). The Kafka topic could then be consumed by some process that would then somehow emit to (or be pulled from) Prometheus. Perhaps a streaming aggregator of some kind? The proposed Stream Data Platform program (final name still TBD) next year might make this kinda stuff way easier.

@Ottomata That's an interesting idea. I can't think of a good reason not to use EventLogging for this on server-side (e.g. all of HTTP-ingestion, validation, storage, processing, consumption, etc.). One thing I think we may need to address though is the client-side production of those packets. Right now the typical way we generate EventLogging data from client-side JS in MediaWiki is in my opinion not well-suited for this use case. But, I think it makes total sense to work on improving that.

A few points come to mind:

  • Do we send a beacon for each key/value pair? (Good: Simple schema. Bad: Inefficient in terms of client CPU/JS and Network)
    • If we want batching, what are our options?
      • Can we declare an EventLogging schema with a property that must contain an list of objects following a certain shape? E.g. { event: { things: [ string, string, .. ] } } or { event: { things: [ { key: .., value: .. }, { key: .., value: .. } ] } }.
  • Do we code the whitelist of metric names and types into the schema? (Good: All validation in one place. Bad: Schema would get very large, especially given current EventLogging MW-JS downloads schema to client for validation, and statsv-eventlogging is also used outside MediaWiki context, e.g. WebPageTest).
    • We could build a simplified MW client that build its event object without downloading the schema and without client-side validation. The EventLogging server would still validate it. However, maintaining the whitelist on meta.wikimedia.org and needing to include the revision ID in events could become hard to maintain
    • We could keep the schema very simple (any metric name, any type) and do the validation in the python processor that consumes eventlogging-statsv. This means doing validation in two places, and means that invalid submissions don't go to the "invalid" queue of EventLogging. But instead, we'd do our own validation and presumably write back to another topic before writing to statsd. I don't know if that is a good thing or a bad thing. It would mean we re-use EventLogging for ingestion and raw consumption, but still do our own validation logic.

the typical way we generate EventLogging data from client-side JS in MediaWiki is in my opinion not well-suited for this use case

Hm why? It is pretty similar to how statsv works, no? Is the problem the batching and/or URL character limit?

Do we code the whitelist of metric names and types into the schema?

Hm, slightly relevant are these new EventLogging Schema Guidelines. They are meant to be more compatible with Druid, but the ideas also extend to Prometheus. Druid's 'dimensions' sound just like Prometheus' 'labels' to me, in that they are supposed to be of limited cardinality.

That said, yeah, I'm not sure if we'd want a new schema for every single metric. Perhaps we should use EventBus for this instead. It works via HTTP POST instead of GET, and allows you to send an array of events. It also allows us to configure the same schema for multiple or wildcarded topics. E.g., we could allow some prometheus_metric schema to topics that match something like prometheus.*. The schemas are defined in local git repo (for now), and we also don't require validation on the client side.

the typical way we generate EventLogging data from client-side JS in MediaWiki is in my opinion not well-suited for this use case

Hm why? It is pretty similar to how statsv works, no? Is the problem the batching and/or URL character limit?

Okay, bit a of a thought dump...

I'd say the problem is lack of batching and increased client cost. I assume we can keep the url size down by making the schema very simple. Wrapping in JSON vs query parameters should have minimal overhead.

With statsv we currently go straight from counter function call to a beacon url like /beacon/statsv?foo=1g&bar=2ms&quux=3c. EventLogging JS on the other hand would download two modules (ext.eventLogging.core + schema.*, which includes the json schema itself, which would be quite large if it includes the whitelist of valid metric names). It then also performs client-side validation, which further increases processing time. This is sort of okay for current use of EventLogging which rarely happens more than once during a page view. But for statsv metrics we'd easy send a burst of 20 or 30 at once. And then more later during the session.

To bypass this overhead, we could consider bypassing the EventLogging JS in favour of constructing the event object directly and not do client-side validation. We'd only need to hardcode the revision ID in the JS file (which isn't too bad, given we'd already hardcode it in extension.json otherwise, it'd just be in a different file now).

As for the schema itself, we can either have a simple event object like { metricName: .., metricValue: } or a multi-value object like { metrics: [ { key: .. value: .. }, .. ] }. Multi-value would enable batching, but means validation will need to happen elsewhere. Multi-value would also be incompatible with Druid, but I don't think that would be an issue for this. We'd only use EventLogging as a transportation method, storage would be in Graphite/Prometheus. Or if we want it in Kafka, we could make the new EventLogging-based statsv processor write back to Kafka after the steps for splitting batches and validation.

If we want to do validation in EventLogging, it would mean no batching. It would also mean that the schema will get pretty large, e.g. the metricName field would essentially become an enum with all whitelisted metric names. And maintaining it on Meta-Wiki and updating WikimediaEvents for every addition seems impractical. It would also not work well unless the schema page is protected given it would otherwise allow people to make modifications, and submit beacons with a revision ID we don't yet use and bypass validation that way. (Unless we'd somehow force which revisions we accept in the backend).

This is also why I'd want to bypass the current method of client-side validation because it would end up sending the client a very large enum full of metric names related to other applications (eg. mobile apps, WebPageTest, Jenkins etc.) that don't relate to MediaWiki JS. I'd rather have the whitelist maintained in puppet for the statsv python process.

we could consider bypassing the EventLogging JS in favour of constructing the event object directly and not do client-side validation.

I think this is totally fine. Client side validation is nice to have just in order to inform the client of possible errors before the event is emitted. The client gets no feedback if the event is invalid at the server. However, if you don't care about client side validation anyway, this sounds fine to me. We don't do client side validation for EventBus, but in that case, the client does get feedback if their event is invalid. I guess we could make the client side validation an option in the EventLogging JS? I'm not very familiar with that side of things.

As for batching, to keep things simple it might be nice to not support this, at least at first. If we do 20-30 /beacon/statsv requests right now, 20-30 requests to /beacon/event will work just as well.

It would also mean that the schema will get pretty large, e.g. the metricName field would essentially become an enum with all whitelisted metric names. Hm, do we really need this? I see why you'd want it to keep Prometheus sane, but could we just hope/require that we keep the list of metrics we send statys low? Yes we have the problem that a malicious person could emit anything they wanted, but this is the case for statsv and EventLogging now too.

If we do need validation, I suppose an enum with names is fine, since we won't require the client to grab the schema for validation. Are metric values always the same type?

It would also not work well unless the schema page is protected given it would otherwise allow people to make modifications

This is also true of all EventLogging events. AFAIK it has never been an issue.

What if this looked something like:

  • Use EventLogging /beacon/event with prometheus metric schema on meta (with or without enums?)
  • No client side validation
  • EventLogging validates and produces to topic eventlogging_PrometheusMetric (name TBD)
  • Prometheus Kafka pull service that consumes all messages from eventlogging_PrometheusMetric topic since last pull (tracked with Kafka offset commits), perhaps discarding really old metrics via Kafka's new message timestamp info, and returns those to Prometheus.

Hm, or, one more idea:

  • Keep using /beacon/statsv
  • Prometheus Kafka pull service that consumes all messages from statsv topic since last pull and returns those to Prometheus. This service has a list of allowed prometheus keys and/or hardcoded schema and does validation on its own?

(I'm not sure if I like ^, but I'm leaving it here for discussion.)

BTW, as part of https://wikitech.wikimedia.org/wiki/User:Ottomata/Stream_Data_Platform, I want to make what you are suggesting (and much more) very easy to do. EventLogging is our current client side event intake system, but that may change in the next year or so. :)

This feels a lot like reinventing the wheel to me. What about something like statsd_exporter? It accepts data in statsd format, and gives us the choice of having statsv write to it directly, or having it sit downstream from our existing statsd.

I'm sure I'm missing something, though!

What about something like statsd_exporter?

Oh. This sounds perfect to me!

This feels a lot like reinventing the wheel to me. What about something like statsd_exporter? It accepts data in statsd format, and gives us the choice of having statsv write to it directly, or having it sit downstream from our existing statsd.
I'm sure I'm missing something, though!

We've tried statsd_exporter before for Thumbor in T145867: Test making thumbor statsd metrics available from Prometheus and it worked ok, though note that the exporter requires maintaining a configuration to map statsd -> prometheus key/values. I suspect such configuration would have the schema/key validation problems @Krinkle has pointed out. I don't know enough about statsv though to know whether this would be a problem in practice.

If the scope is just adding Prometheus support to statsv.py I recommend starting by adding the Prometheus python client there (alongside statsd) and expose the resulting metrics over HTTP and see how that goes. Stricter schema validation, using EL, etc can come later.

I was re-reading this task and T175087: Create a navtiming processor for Prometheus in the context of progressively moving away from statsd/graphite (T205870), some technical thoughts on how IMHO we could move this forward:

  • From an operational perspective the metric names/values are untrusted input, thus we need to have a separate Prometheus instance for "external" data as @Krinkle suggested
  • Having a separate instance also means we can tweak retention as needed, i.e. longer than standard 90d for production (or 13m for global)
  • Deploy statsd_exporter on webperf to get an idea of how the schema/mappings would look like. The benefits being that we already have puppetization / deployments for statsd_exporter, and we deploy it "inline" with respect to statsd traffic. In other words all udp traffic received by statsd_exporter is relayed as is to statsd.eqiad.wmnet so the exporter is effectively transparent
  • The resulting Prometheus metrics will be exported on http but we can hold on to have Prometheus actually scrape them until we have a better schema/configuration for statsd_exporter in place.
  • Examining the Prometheus metrics will also inform what clients might need adjusting (e.g. PagePreviews uses top level for all metrics, no namespace, e.g. PagePreviewsApiResponse:52|ms PagePreviewsPreviewShow:722|ms
  • For metrics not matching any statsd_exporter mapping we could explicitly prefix them e.g. with unmapped_
fgiunchedi moved this task from Inbox to Radar on the observability board.Dec 9 2019, 12:03 PM