Page MenuHomePhabricator

Numerous Graphite queries stopped working (timeout after 6s)
Closed, ResolvedPublic

Description

As of around 14 Dec (may be earlier), I've noticed many Graphite queries that used to work fine are now timing out.

I first reported this on IRC in the wikimedia-sre channel on 15 Dec:

<Krinkle> hm.. have we added restrictions to Graphite recently in terms of timeouts?
<Krinkle> https://grafana.wikimedia.org/d/000000430/resourceloader-modules-overview?orgId=1
<Krinkle> I can't seem to load the latency graphs here, showing an error each time due to "time out after 6.0 seconds"
<Krinkle> not sure why that's taking 6s though
<Krinkle> tried removing the transforms and reducing from 3d to 12h, but no dice
<Krinkle> query: 'MediaWiki.resourceloader_build.*.p99'

Hereby on Phab as well as it seems to be a persistent issue.

For example, "High avg build rate" at:

<body style="background-color: #666666; color: black;"> <center> <h2 style='font-family: "Arial", sans-serif'> <p>Graphite encountered an unexpected error while handling your request.</p> <p>Please contact your site administrator if the problem persists.</p> </h2> <br/> <div style="width: 50%; text-align: center; font-family: monospace; background-color: black; font-weight: bold; color: #ff4422;"> </div> <div style="width: 70%; text-align: left; background-color: black; color: #44ff22; border: thin solid gray;"> <pre> Traceback (most recent call last): File &quot;/usr/lib/python3/dist-packages/graphite/worker_pool/pool.py&quot;, line 113, in pool_exec job = queue.get(True, wait_time) File &quot;/usr/lib/python3.9/queue.py&quot;, line 179, in get raise Empty _queue.Empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File &quot;/usr/lib/python3/dist-packages/graphite/storage.py&quot;, line 105, in wait_jobs for job in self.pool_exec(jobs, timeout): File &quot;/usr/lib/python3/dist-packages/graphite/worker_pool/pool.py&quot;, line 115, in pool_exec raise PoolTimeoutError(&quot;Timed out after %fs&quot; % (time.time() - start)) graphite.worker_pool.pool.PoolTimeoutError: Timed out after 6.000199s During handling of the above exception, another exception occurred: Traceback (most recent call last): File &quot;/usr/lib/python3/dist-packages/django/core/handlers/exception.py&quot;, line 34, in inner response = get_response(request) File &quot;/usr/lib/python3/dist-packages/django/core/handlers/base.py&quot;, line 115, in _get_response response = self.process_exception_by_middleware(e, request) File &quot;/usr/lib/python3/dist-packages/django/core/handlers/base.py&quot;, line 113, in _get_response response = wrapped_callback(request, *callback_args, **callback_kwargs) File &quot;/usr/lib/python3/dist-packages/graphite/errors.py&quot;, line 101, in new_f return f(*args, **kwargs) File &quot;/usr/lib/python3/dist-packages/graphite/render/views.py&quot;, line 129, in renderView data.extend(evaluateTarget(requestContext, targets)) File &quot;/usr/lib/python3/dist-packages/graphite/render/evaluator.py&quot;, line 19, in evaluateTarget prefetchData(requestContext, pathExpressions) File &quot;/usr/lib/python3/dist-packages/graphite/render/datalib.py&quot;, line 292, in prefetchData for result in STORE.fetch(pathExpressions, startTime, endTime, now, requestContext): File &quot;/usr/lib/python3/dist-packages/graphite/storage.py&quot;, line 187, in fetch results = self.wait_jobs(jobs, settings.FETCH_TIMEOUT, File &quot;/usr/lib/python3/dist-packages/graphite/storage.py&quot;, line 122, in wait_jobs raise Exception(message)

Exception: Timed out after 6.000234s for fetch for [&#39;MediaWiki.resourceloader_build.*.sample_rate&#39;]

</pre> </div> </center>

Event Timeline

I've tried simplifying the query in question, but it seems to not make a difference. The exception raised from Graphite's Python code suggests the timeout is happening at the lower level of an individual fetch, not per-se in the overall processing. Hence I'm guessing that indeed the complexity of the overall combinations and functions is not the problem.

Krinkle updated the task description. (Show Details)

I took a quick look and the timeout seems related to the amount of metrics matching MediaWiki.resourceloader_build.*.sample_rate (about ~5k). For example changing to MediaWiki.resourceloader_build.wikibase*.sample_rate I can load 2/30 days of data. The change might be due to (either or both):

  • The Graphite Bullseye migration (i.e. graphite upgrade)
  • The cardinality of metrics matching MediaWiki.resourceloader_build.*.sample_rate has increased recently

We can test bumping FETCH_TIMEOUT to e.g. 12s and see how far that gets us

5000 sounds about right, I think previously it may've been around 4000.

/srv/carbon/whisper/MediaWiki/resourceloader_build$ ls -l | wc -l
5047
resourceloader_build$ find . -maxdepth 2 -mtime +10 -name 'sample_rate.wsp' | xargs -I '%' dirname % | wc -l
1224

The stale entries here look normal and organic. Nothing out of control that is exploding or unstably creating new metrics, fortunately. The noise is mostly extensions that were refactored or undeployed over the past two years. Results at P18389.

I've pruned these from graphite1004 and graphite2002, there are now ~3800 entries left.

I don't know if the cardinality knowledge is cached for a long time (in my experience, changes to whisper data generally take effect immediately, or within a minute). But the same qureies seem to still timeout after 6 seconds.

Change 751686 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: bump FETCH_TIMEOUT

https://gerrit.wikimedia.org/r/751686

Change 751686 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: bump FETCH_TIMEOUT

https://gerrit.wikimedia.org/r/751686

Krinkle assigned this task to fgiunchedi.

Works for me. Thanks!