Investigate how Kartotherian metrics are published and what they mean
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Gehel
	Nov 3 2016, 11:15 AM

Description

While browsing Kartotherian metrics, I see a few metrics which I don't understand, or which seems to be reporting too broad aggregates. For example, the kartotherian.req.* metrics seem to report an aggregate of all requests. This should probably be split by cluster (eqiad / codfw / maps-test) to make more sense. Worse, kartotherian.heap.* seems to also be aggregated, where heap mostly make sense when viewed for a single instance. It also seems that some metrics are not sent using the correct type. We collect percentiles for heap, which does not seem to make sense. Heap should be a metric of type "gauge" and should not collect percentiles.

Some investigation is needed to understand how those metrics are published, which one make sense and which one don't. We need to document what we want to achieve with those metrics and check that implementation is done accordingly.

Related Objects
Search...

Status	Assigned	Task
Declined	None	T149889 Investigate how Kartotherian metrics are published and what they mean
Resolved	Yurik	T150254 cleanup marker metrics published by Kartotherian
Resolved	Gehel	T150353 delete unused kartotherian marker metrics
Declined	Gehel	T150460 Configure maps cluster to send statsd metrics to the statsd endpoint in the same datacenter

Event Timeline

Gehel created this task.Nov 3 2016, 11:15 AM

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptNov 3 2016, 11:15 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The "req" metrics is being sent by tiles.js:

kartotherian.req.<style name>.<zoom>.<format> [.<scale>]

So for the typical tile, you could get "osm-intl", "15", "png", and possibly 2, or 1-3, or nothing for the scale. I could fairly easily introduce some global configuration (if its not available already) to prepend, e.g. instead of kartotherian.req it would be kartotherian.eqiad.req.*, or even kartotherian.maps1001.eqiad.req.*

We should consider the number of distinct metrics - current is 18 zooms * 6 formats * 3 styles * 5 scales - about 1600. With extra server info, that would push it to 13000. Plus I suspect there are multiple series for p99, p95, p75...

Other strange thing: If I sum the request rates for all zoom levels (as reported by kartotherian) I see around 1K requests per second. If I look at what is reported by Varnish, I see around 100 requests per second. Something is wrong (and the something wrong might be me).

A quick look at graphite1001 indicates that we already publish ~ 64k metrics for Kartotherian:

gehel@graphite1001:~$ find /var/lib/carbon/whisper/kartotherian/ -type f | wc -l 
63986

Most of the metrics published by kartotherian are "markers":

gehel@graphite1001:/var/lib/carbon/whisper$ find kartotherian/marker/ -type f | wc -l
57840

almost half of them have not been updated in the last 15 days:

gehel@graphite1001:/var/lib/carbon/whisper$ find kartotherian/marker/ -type f -mtime +15 | wc -l
22380

It seems that not all markers follow the same pattern. This might indicate a rename of the metric. Some cleanup might make sense (unless historical data are useful for some analysis).

Gehel created subtask T150254: cleanup marker metrics published by Kartotherian.Nov 8 2016, 12:34 PM

Gehel added a project: Maps-Sprint.Nov 9 2016, 7:08 PM

Gehel created subtask T150353: delete unused kartotherian marker metrics.Nov 9 2016, 7:16 PM

Gehel triaged this task as High priority.Nov 9 2016, 7:27 PM

Gehel added a project: Epic.

Gehel moved this task from Backlog to In progress on the Maps-Sprint board.

the following metrics can be deleted:

kartotherian.req.s*

kartotherian.req.s* metrics are deleted

I'm learning things... The "rate" of a timer for statsite is timer_sum(&t->tm) / GLOBAL_CONFIG->flush_interval which is the total amount of time on the interval in millisecond divided by the interval size in seconds. So it is a ratio representing the amount of work per second. The "sample_rate" of a timer is the number of events per second. And the "count" for a timer is the number of events per interval. This is starting to make sense.

Gehel created subtask T150460: Configure maps cluster to send statsd metrics to the statsd endpoint in the same datacenter.Nov 10 2016, 6:23 PM

Gehel created subtask T150466: publish kartotherian / tilerator metrics by cluster.Nov 10 2016, 6:48 PM

This seems to be a "it would be nice to investigate and sort this out", which doesn't seem to make the cut given that the team is spinning down. As long as we have data for the defined KPIs, digging through the other data does not seem necessary. Accordingly, I am declining this.

If this is causing some technical issues for @Gehel that I have not understood, he can reopen and we can look at investigating.

Gehel closed subtask T150254: cleanup marker metrics published by Kartotherian as Resolved.Feb 23 2017, 1:44 PM

Gehel closed subtask T150353: delete unused kartotherian marker metrics as Resolved.Feb 23 2017, 2:01 PM

Gehel closed subtask T150460: Configure maps cluster to send statsd metrics to the statsd endpoint in the same datacenter as Declined.Aug 7 2018, 4:17 PM

MSantos removed a subtask: T150466: publish kartotherian / tilerator metrics by cluster.Nov 25 2020, 12:26 PM

Investigate how Kartotherian metrics are published and what they meanClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate how Kartotherian metrics are published and what they mean
Closed, DeclinedPublic
Actions

Related Objects
Search...