Page MenuHomePhabricator

Graphite query timeout from ExtensionDistributor
Closed, ResolvedPublic

Description

I'm looking at 500s on graphite hosts and noticed ExtensionDistributor querying graphite for 4w of data from all extensions multiple times a day (even every five minutes sometimes):

2021-10-25T07:55:42     6081006 10.192.0.102    uwsgi-handler/500       3076    GET     http://graphite.wikimedia.org/render/?target=sortByMaxima%28groupByNode%28summarize%28MediaWiki.extdist.extensions.%2A.%2A.su
m%2C%224w%22%2C%22sum%22%2Ctrue%29%2C3%2C%22sum%22%29%29&format=json&from=-4w&until=now -       -       -       2620:0:861:102:10:64:16:167, 10.64.48.101       www.mediawiki.org - ExtensionDistributor  - Mediawiki Extension      -       -       -       -       10.192.0.102    5428d888-0d45-4e53-b7d6-4bd99d4e95fb    2620:0:861:102:10:64:16:167

The query times out (6s timeout) I believe due to a combination of the expression itself and the data to fetch (~1k metric files). The unquoted query is 'sortByMaxima(groupByNode(summarize(MediaWiki.extdist.extensions.*.*.sum,"4w","sum",true),3,"sum"))'

I'm going to try and bump the timeout a little first

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-10-25T09:18:41Z] <godog> bounce graphite-web on graphite2003 to test timeout bump - T294220

Mentioned in SAL (#wikimedia-operations) [2021-10-25T09:52:19Z] <godog> bounce uwsgi graphite web on graphite2003 - T294220

I've temporarily applied another fix (still with a 6s timeout) https://gerrit.wikimedia.org/r/c/operations/puppet/+/734225 and things seem to have improved. I still see the occasional timeout for the query above. What's the use case for it? Is it a problem if the query occasionally fails ?

ExtensionDistributor querying graphite for 4w of data from all extensions multiple times a day (even every five minutes sometimes):

There are two lists, one for extensions and one for skins, but it should be cached in memcached for 24h. It's entirely possible this caching has broken subtly somehow, no one has really touched it since it was originally written in late 2015.

I've temporarily applied another fix (still with a 6s timeout) https://gerrit.wikimedia.org/r/c/operations/puppet/+/734225 and things seem to have improved. I still see the occasional timeout for the query above. What's the use case for it?

When it works, it displays a list of the top 15 most popular extensions at https://www.mediawiki.org/wiki/Special:ExtensionDistributor. The skin query probably has much less data and still works: https://www.mediawiki.org/wiki/Special:SkinDistributor

Is it a problem if the query occasionally fails ?

It gracefully falls back to hiding the info which is not great but works. I think ExtensionDistributor's use of graphite as an API is unique in MediaWiki and maybe there's something better we can use?

I tried loading https://grafana.wikimedia.org/d/000000161/extension-distributor-downloads?orgId=1 and the panel for extensions (using roughly the same query) also timed out. I'm not sure if it's possible to optimize the query or record the data in a different way.

ExtensionDistributor querying graphite for 4w of data from all extensions multiple times a day (even every five minutes sometimes):

There are two lists, one for extensions and one for skins, but it should be cached in memcached for 24h. It's entirely possible this caching has broken subtly somehow, no one has really touched it since it was originally written in late 2015.

Oh, but if the request fails, it won't cache anything and try again whenever the next request is. We can probably add some negative caching.

Change 734406 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/ExtensionDistributor@master] Add negative caching to Graphite stats requests

https://gerrit.wikimedia.org/r/734406

ExtensionDistributor querying graphite for 4w of data from all extensions multiple times a day (even every five minutes sometimes):

There are two lists, one for extensions and one for skins, but it should be cached in memcached for 24h. It's entirely possible this caching has broken subtly somehow, no one has really touched it since it was originally written in late 2015.

Thank you for taking a look and the context!

I've temporarily applied another fix (still with a 6s timeout) https://gerrit.wikimedia.org/r/c/operations/puppet/+/734225 and things seem to have improved. I still see the occasional timeout for the query above. What's the use case for it?

When it works, it displays a list of the top 15 most popular extensions at https://www.mediawiki.org/wiki/Special:ExtensionDistributor. The skin query probably has much less data and still works: https://www.mediawiki.org/wiki/Special:SkinDistributor

Is it a problem if the query occasionally fails ?

It gracefully falls back to hiding the info which is not great but works. I think ExtensionDistributor's use of graphite as an API is unique in MediaWiki and maybe there's something better we can use?

I tried loading https://grafana.wikimedia.org/d/000000161/extension-distributor-downloads?orgId=1 and the panel for extensions (using roughly the same query) also timed out. I'm not sure if it's possible to optimize the query or record the data in a different way.

AFAICS the current query will get the per-release data (e.g. MediaWiki.extdist.extensions.3D.REL1_29.sum) though there's also a general sum metric without the release (MediaWiki.extdist.extensions.3D.sum which I'm assuming gets update as well?

If so then a query that loads less data could be the answer, I can indeed load sortByMaxima(groupByNode(summarize(MediaWiki.extdist.extensions.*.sum,"4w","sum",true),3,"sum")) in the graphite browser. Something else I'm not sure I understand is asking graphite for -4w of data, the summarize() should already look back so you can ask for a shorter timeframe in the HTTP API and results should be the same (I have not verified this though)

Change 734406 merged by jenkins-bot:

[mediawiki/extensions/ExtensionDistributor@master] Add negative caching to Graphite stats requests

https://gerrit.wikimedia.org/r/734406

fgiunchedi claimed this task.

This is decidedly better now, tentatively resolving

graphite1004:/var/log/apache2$ zgrep sortByMaxima other_vhosts_access.log.* | grep -c /200
152
graphite1004:/var/log/apache2$ zgrep sortByMaxima other_vhosts_access.log.* | grep -c /500
81