Page MenuHomePhabricator

Add a graph of ORES Celery task queue length
Closed, ResolvedPublic

Description

We would connect to the Celery Redis instance and do:

ores1001.eqiad.wmnet:6379> LLEN celery

This number might be interesting because it tells us how the task queue is behaving, whether it empties immediately, rides the upper bound, or is fluctuating.

Event Timeline

The best way to handle it is to have a Prometheus exporter for celery. I found this already: https://github.com/zerok/celery-prometheus-exporter
@akosiaris Do you think we can deploy this to prod?

Ladsgroup triaged this task as Medium priority.Jan 21 2019, 3:05 PM

We can probably get away with reusing https://github.com/oliver006/redis_exporter that we already use. It does have a check-keys parameter that allow us to count a lists elements. It's a bit slower as an implementation that in the next version as it uses scan but in my tests it did return within ~1s. I get in prometheus the following as an example

redis_key_size{addr="localhost:6380",alias="",db="db0",key="foobar"} 4

after having executed LPUSH foobar 10 4 times.

@Ladsgroup would that suffice? If yes, which is/are the key(s)?

That looks awesome! the key is "celery"

Change 486238 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ores: Enable checking of celery list key

https://gerrit.wikimedia.org/r/486238

Change 486238 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: Enable checking of celery list key

https://gerrit.wikimedia.org/r/486238

Change has been merged and deployed but up to now no data has been exported. The celery key currently looks empty so I guess this is expected?

Right. Only when we're approaching overload does the celery key contain entries. All workers need to be busy before we see anything.

akosiaris claimed this task.

Right. Only when we're approaching overload does the celery key contain entries. All workers need to be busy before we see anything.

Cool then. I think the last step before resolving this is creating a graph in https://grafana.wikimedia.org/d/000000255/ores. I just added a (collapsed by default) row there with the proper query. I can't currently test it of course, please do have a look in case I made a mistake.

Thank you so much. I moved the panels to below scores. I hope you don't mind.

We'll know for sure during the next overload event.

Confirmed during an event on Jan 23rd.