Page MenuHomePhabricator

Set sensible thread limit to Blazegraph
Closed, ResolvedPublic

Description

We've had an incident where thread count in blazegraph rose up to 9K and clogged the system:

https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=22&fullscreen&from=1538604561831&to=1538621054757

Maybe one thing we could do is to make a filter where if the count of active threads is too high, we just reject the new GET requests.

Another idea is to track thread usage per client and throttle clients that use too many threads. Not sure whether it is feasible, needs to be checked.

Event Timeline

Smalyshev created this task.Oct 4 2018, 4:27 AM
Restricted Application added a project: Wikidata. · View Herald TranscriptOct 4 2018, 4:27 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev updated the task description. (Show Details)Oct 4 2018, 4:28 AM

Bryan advises against setting hard limits on executor, so the options for limiting thread growth are:

  • Not launching new queries if the thread count too high
  • Limit number of simultaneous queries
  • Limit PipelineOp.Annotations.MAX_PARALLEL
Smalyshev triaged this task as Normal priority.Nov 8 2018, 7:03 AM
Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.Nov 8 2018, 6:40 PM
Smalyshev added a comment.EditedNov 16 2018, 8:09 PM

Looking at performance graphs, in regular operation number of threads stays well under 1000 (maximum over 3 months for all public servers is 1014). Of those about 300 come from non-Executor services, so normal count of executorService threads is around 700. I think if we start to refuse service when executorService thread count is around 2000-3000, it would give us comfortable margin above the usual workload while not allowing it to get into too much trouble.

Change 474374 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Start refusing queries if executor has more than 2000 threads running

https://gerrit.wikimedia.org/r/474374

Smalyshev added a comment.EditedNov 16 2018, 11:59 PM

Testing with the thread limiting patch on wdqs1010, I see the thread count never go over 2700, and as soon as load is removed, the service starts recovering within minutes, with no lingering effects.

I think we should try running with this patch and see if it limits the runaway scenario.

Another thing we might want to consider is there's a queue of requests that are in Jetty queue even before they reach QueryServlet. On high load, a lot of these requests, by the time they reach execution, would be already abandoned by their clients, disconnected, etc. I wonder if we can't check this and say that if request have been sitting in the queue for more than X time, we'd better give up on it because the client probably did so anyway.

Smalyshev moved this task from Next to In review on the User-Smalyshev board.Nov 17 2018, 12:05 AM
Smalyshev moved this task from In review to Done on the User-Smalyshev board.Nov 17 2018, 12:27 AM

Change 474374 merged by jenkins-bot:
[wikidata/query/rdf@master] Start refusing queries if executor has more than 2000 threads running

https://gerrit.wikimedia.org/r/474374

Smalyshev closed this task as Resolved.Nov 21 2018, 6:11 AM
Smalyshev claimed this task.