Page MenuHomePhabricator

Investigate how Kartotherian metrics are published and what they mean
Closed, DeclinedPublic

Description

While browsing Kartotherian metrics, I see a few metrics which I don't understand, or which seems to be reporting too broad aggregates. For example, the kartotherian.req.* metrics seem to report an aggregate of all requests. This should probably be split by cluster (eqiad / codfw / maps-test) to make more sense. Worse, kartotherian.heap.* seems to also be aggregated, where heap mostly make sense when viewed for a single instance. It also seems that some metrics are not sent using the correct type. We collect percentiles for heap, which does not seem to make sense. Heap should be a metric of type "gauge" and should not collect percentiles.

Some investigation is needed to understand how those metrics are published, which one make sense and which one don't. We need to document what we want to achieve with those metrics and check that implementation is done accordingly.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The "req" metrics is being sent by tiles.js:

kartotherian.req.<style name>.<zoom>.<format> [.<scale>]

So for the typical tile, you could get "osm-intl", "15", "png", and possibly 2, or 1-3, or nothing for the scale. I could fairly easily introduce some global configuration (if its not available already) to prepend, e.g. instead of kartotherian.req it would be kartotherian.eqiad.req.*, or even kartotherian.maps1001.eqiad.req.*

We should consider the number of distinct metrics - current is 18 zooms * 6 formats * 3 styles * 5 scales - about 1600. With extra server info, that would push it to 13000. Plus I suspect there are multiple series for p99, p95, p75...

Other strange thing: If I sum the request rates for all zoom levels (as reported by kartotherian) I see around 1K requests per second. If I look at what is reported by Varnish, I see around 100 requests per second. Something is wrong (and the something wrong might be me).

A quick look at graphite1001 indicates that we already publish ~ 64k metrics for Kartotherian:

gehel@graphite1001:~$ find /var/lib/carbon/whisper/kartotherian/ -type f | wc -l 
63986

Most of the metrics published by kartotherian are "markers":

gehel@graphite1001:/var/lib/carbon/whisper$ find kartotherian/marker/ -type f | wc -l
57840

almost half of them have not been updated in the last 15 days:

gehel@graphite1001:/var/lib/carbon/whisper$ find kartotherian/marker/ -type f -mtime +15 | wc -l
22380

It seems that not all markers follow the same pattern. This might indicate a rename of the metric. Some cleanup might make sense (unless historical data are useful for some analysis).

Gehel triaged this task as High priority.Nov 9 2016, 7:27 PM
Gehel added a project: Epic.
Gehel moved this task from Backlog to In progress on the Maps-Sprint board.

the following metrics can be deleted:

kartotherian.req.s*

kartotherian.req.s* metrics are deleted

I'm learning things... The "rate" of a timer for statsite is timer_sum(&t->tm) / GLOBAL_CONFIG->flush_interval which is the total amount of time on the interval in millisecond divided by the interval size in seconds. So it is a ratio representing the amount of work per second. The "sample_rate" of a timer is the number of events per second. And the "count" for a timer is the number of events per interval. This is starting to make sense.

Deskana subscribed.

This seems to be a "it would be nice to investigate and sort this out", which doesn't seem to make the cut given that the team is spinning down. As long as we have data for the defined KPIs, digging through the other data does not seem necessary. Accordingly, I am declining this.

If this is causing some technical issues for @Gehel that I have not understood, he can reopen and we can look at investigating.