Page MenuHomePhabricator

Health monitoring for Kartotherian service
Closed, ResolvedPublic5 Estimated Story Points

Description

We have a few dashboards for maps services, but none give quite the right information to track user-facing service reliability. For example, we should be able to track the count and proportion of static mapframe thumbnail requests which fail, or are missing mapdata.

Remaining work:

  • Verify service request metrics on maps-performances board against webrequest logs
    • The current metrics seems to not be accurate, they could be as much as 3x higher than the actual request counts.
  • Panels showing the error rate per service and proportion of requests affected
    • Not making panels yet. Possibly inaccurate and not very useful prototype panel here.
    • Geoshapes have roughly a 15% error rate, snapshots are at 7%.

Event Timeline

We might have a challenge getting request error rates for the maps servers. Varnish isn't tracking this data in a way that can be filtered to just the maps backends.

If we used the envoy proxy for maps, these numbers would show up on https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1 . I don't know enough to make an argument for this being a good idea or not, however.

Change 835154 had a related patch set uploaded (by Awight; author: Awight):

[mediawiki/services/kartotherian@master] Include some documentation about the StatsD metrics emitted.

https://gerrit.wikimedia.org/r/835154

Note to self: need to compare webrequest counts against https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1 , the sample_rate metrics are particularly worrisome, and I don't understand why rate is so unreasonably high.

awight renamed this task from [Stub] Health monitoring for Kartotherian service to Health monitoring for Kartotherian service.Sep 29 2022, 1:34 PM
awight removed awight as the assignee of this task.
awight updated the task description. (Show Details)
awight set the point value for this task to 5.

Sampling the webrequest table for Aug 15, 2022, 1-2h UTC:

with requests as (
SELECT
  uri_path, uri_query,
  http_method,
  http_status,
  content_type,
  cache_status, time_firstbyte, response_size
from webrequest
where
  uri_host='maps.wikimedia.org'
  and webrequest_source='upload'
  and year=2022 and month=8 and day=15 and hour=1
)

select
  count_if(regexp_like(uri_path, '^/osm-intl/.*.png$')) as tile_png,
  count_if(uri_path = '/geoline') as geoline,
  count_if(uri_path = '/geoshape') as geoshape,
  count_if(uri_path = '/geopoint') as geopoint,
  count_if(regexp_like(uri_path, '^/img/')) as snapshot,
  count_if(substr(http_status, 1, 1) = '2') as status_2xx,
  count_if(substr(http_status, 1, 1) = '3') as status_3xx,
  count_if(substr(http_status, 1, 1) = '4') as status_4xx,
  count_if(substr(http_status, 1, 1) = '5') as status_5xx
from requests
tile_pnggeolinegeoshapegeopointsnapshotstatus_2xxstatus_3xxstatus_4xxstatus_5xx
15723346779591635311501174439430317510008927

Grafana "maps performances" dashboard for the same time window: https://grafana.wikimedia.org/goto/La4xRO4Vz?orgId=1

Roughly:
tile_png = 62.6 x 3600 = 225,000
geoline = 4.43 x 3600 = 16,000
geoshape = 6.3 x 3600 = 23,000
snapshot = (12.9 + 19.8) x 3600 = 118,000

Rewriting the snapshot graph to use the "rate" metric, I still get a bad total:
sumSeriesWithWildcards(kartotherian.req.*.*.*.static.*.rate, 2, 3, 4, 6) -> a rate of 27,000

"count" also gives the 118,000 number seen in the current graph.

In other words: not going well so far, the webrequest numbers don't match up.

Change 835154 merged by jenkins-bot:

[mediawiki/services/kartotherian@master] Include some documentation about the StatsD metrics emitted.

https://gerrit.wikimedia.org/r/835154

Drilling down to a detailed count of snapshots, there were only 279k requests with HTTP status 2xx, that should have counted as successes:

with requests as (
SELECT
  http_status,
  cache_status
from webrequest
where
  uri_host='maps.wikimedia.org'
  and webrequest_source='upload'
  and year=2022 and month=8 and day=15 and hour=1
  and regexp_like(uri_path, '^/img/')
)

select
  count(*) as snapshot,
  count_if(substr(http_status, 1, 1) = '2') as status_2xx,
  count_if(substr(http_status, 1, 1) = '3') as status_3xx,
  count_if(substr(http_status, 1, 1) = '4') as status_4xx,
  count_if(substr(http_status, 1, 1) = '5') as status_5xx
from requests
snapshotstatus_2xxstatus_3xxstatus_4xxstatus_5xx
31150127929310490217180

At this point I stopped to remind myself of how Graphite's rollups work: https://github.com/statsd/statsd/blob/master/docs/graphite.md#storage-aggregation and check our overrides https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/graphite/base.pp . Slightly alarming text is probably nothing to worry about: "However, what about the .count metric also sent for timers? Each sample contains the count of occurrences per flush interval, so you want these samples summed-up, not averaged!". I think we're fine taking the sum of .count metrics.

So comparing with the statsd numbers in Graphite,

aggregateWithWildcards(kartotherian.req.*.*.*.static.*.count, 'sum', 2, 3, 4, 6)

we have 116k successful requests recorded.

kartotherian.req.*.*.*.static.*.sample_rate gives very suspicious metrics that I don't believe can be used directly, since each sample rate would have to be multiplied by the count to give the correct weighting:

image.png (835×1 px, 125 KB)

Change 838143 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/services/kartotherian@master] [WIP] Mark portentially bogus code with FIXMEs

https://gerrit.wikimedia.org/r/838143

Change 838148 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/services/kartotherian@master] Remove dead code that converts coordinates twice

https://gerrit.wikimedia.org/r/838148

@MSantos found a mistake in my webrequest query, unfortunately it only makes the gap larger however. We should be counting successful or cached requests (status 2xx and 3xx) which are served after a backend request, so any level of cached response should be excluded.

Updated detail query:

with requests as (
SELECT
  http_status,
  cache_status
from webrequest
where
  uri_host='maps.wikimedia.org'
  and webrequest_source='upload'
  and year=2022 and month=8 and day=15 and hour=1
  and regexp_like(uri_path, '^/img/')
  and substr(http_status, 1, 1) in ('2', '3')
)

select
  count(*) as count,
  cache_status
from requests
group by cache_status
countcache_status
141174hit-local
391pass
41105miss
107094hit-front
19int-front

I think we're looking for statuses "pass" and "miss", which add up to 41k requests, roughly 1/3 as many requests as were recorded by the backend!

I'm going to conclude the first point of this task as "verified not okay" and suggest that we follow up in later work.

Error rates

These are just counts of HTTP status 2xx or 3xx (successful) vs 4xx or 5xx (unsuccessful), they don't account for subtleties such as missing geoshapes in a snapshot.

Snapshots

SELECT
  count(*) as count,
  substr(http_status, 1, 1) in ('2', '3') as is_successful
from webrequest
where
  uri_host='maps.wikimedia.org'
  and webrequest_source='upload'
  and year=2022 and month=8 and day=15 and hour=1
  and regexp_like(uri_path, '^/img/')
group by
  substr(http_status, 1, 1) in ('2', '3')
countis_successful
289783true
21718false

Roughly 7.5% of requests fail.

Geoshape

Same query but with regexp_like(uri_path, '^/geoshape')

countis_successful
77872true
13763false

17.7% error rate

Geoline

regexp_like(uri_path, '^/geoline')

countis_successful
60499true
7296false

12.1% error rate

Tiles

regexp_like(uri_path, '^/osm-intl/')

countis_successful
1519574true
56252false

3.7% error rate

Maki markers

regexp_like(uri_path, '^/v4/marker/')

countis_successful
59456true
47false

0.08% failure rate

awight updated the task description. (Show Details)

Change 838143 abandoned by Thiemo Kreuz (WMDE):

[mediawiki/services/kartotherian@master] [WIP] Mark portentially bogus code with FIXMEs

Reason:

Resolved, see I6fe29c2.

https://gerrit.wikimedia.org/r/838143

Change 838148 merged by jenkins-bot:

[mediawiki/services/kartotherian@master] Remove dead code that converts coordinates twice

https://gerrit.wikimedia.org/r/838148