Gerrit DoS
Closed, ResolvedPublic

Description

Gerrit ssh interactive commands can sometime no more be possible due to starved thread pool. This morning git operations on Gerrit over ssh ended up being quite slow, the reason is we had too many idling tasks starving the thread pool of eight workers. That is a denial of service.

From P6455

TaskStatusSinceCommand
d573f50aDec-03 15:26git-receive-pack '/mediawiki.git' (ppabcd)
09db7204Dec-07 14:40git-receive-pack '/mediawiki/extensions/BlueSpiceUEModulePDF.git' (itspiderman990123)
a8045f28Dec-08 20:08git-upload-pack '/mediawiki/extensions/GeoData.git' (umherirrender)
408e382607:44:51.705git-upload-pack '/mediawiki/extensions/ContentTranslation.git' (santhosh)
404078a907:45:24.207git-upload-pack '/pywikibot/core' (xqt)
6ae9db3207:51:40.616git-upload-pack '/pywikibot/core' (xqt)
7117981708:10:11.564git-upload-pack p/mediawiki/core.git
0aaa9f03waiting ....07:52:14.026git-upload-pack '/apps/android/wikipedia.git' (kaartic)
aadb53fbwaiting ....07:55:04.967git-receive-pack '/pywikibot/core' (xqt)
e7dae6c9waiting ....07:59:49.756git-receive-pack '/pywikibot/core' (xqt)

The reason is we apparently have solely 8 worker threads and an idle timeout set at 10 days:

sshd.threads8
sshd.idleTimeout864,000 seconds

The reason is Mina SSHD timed out ( https://gerrit.wikimedia.org/r/#/c/58126/ - T49004 ) which caused Zuul to no more receive events over gerrit stream-events after some time (T48917).

It is supposedly fixed since Gerrit 2.6 by @QChris since https://gerrit-review.googlesource.com/c/gerrit/+/44472 (79b91bfe).

sshd.threads defaults to 1.5 times the number of CPU available to the JVM. We have it set to 8, probably we could raise it a bit.

sshd.idleTimeout should definitely be lowered to avoid pilling up ghost connections over and over until the thread pool ends up starved. Careful monitoring of Zuul and gerrit stream-events should be conducted once changed.

Note sshd.batchThreads (set to 2) is taken from the sshd.threads. So we really have 6 workers for the interactive commands (see below for detail).


Low level details

$ gerrit show-queue --wide --by-queue

Queue: SSH-Interactive-Worker
3256276e              09:30:03.327      git-upload-pack p/operations/puppet
d2646b5b              09:30:05.841      git-upload-pack p/operations/puppet
524cfbdf              09:30:07.216      git-upload-pack p/operations/puppet
3251477d              09:30:07.690      git-upload-pack p/operations/puppet
------------------------------------------------------------------------------
  4 tasks, 6 worker threads
          ^^^
java/com/google/gerrit/sshd/CommandExecutorQueueProvider.java
public class CommandExecutorQueueProvider implements QueueProvider {
  ...
  public CommandExecutorQueueProvider(
      @GerritServerConfig Config config,
      ThreadSettingsConfig threadsSettingsConfig,
      WorkQueue queues) {

    poolSize = threadsSettingsConfig.getSshdThreads();
    # which is  sshdThreads = cfg.getInt("sshd", "threads", 2 * cores);
    # And thus is set at 8 with sshd.threads = 8

    batchThreads =
        config.getInt("sshd", "batchThreads", threadsSettingsConfig.getSshdBatchTreads());
        # we have sshd.batchThreads = 2

    if (batchThreads > poolSize) {
      poolSize += batchThreads;
    }

    int interactiveThreads = Math.max(1, poolSize - batchThreads);
    # 8 -2 = 6

    interactiveExecutor =
        queues.createQueue(interactiveThreads, "SSH-Interactive-Worker", Thread.MIN_PRIORITY);
    if (batchThreads != 0) {
      batchExecutor = queues.createQueue(batchThreads, "SSH-Batch-Worker", Thread.MIN_PRIORITY);
    } else {
      batchExecutor = interactiveExecutor;
    }
  }
hashar created this task.Dec 13 2017, 9:15 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 13 2017, 9:15 AM
hashar updated the task description. (Show Details)
hashar updated the task description. (Show Details)
hashar added a subscriber: demon.
hashar updated the task description. (Show Details)Dec 13 2017, 9:37 AM
Aklapper renamed this task from Gerrit to Gerrit DoS.Dec 13 2017, 11:45 AM
demon closed this task as Resolved.Jan 24 2018, 6:40 PM
demon claimed this task.

Fixed by doubling the thread pool from 8 -> 16.

demon triaged this task as Unbreak Now! priority.Jan 24 2018, 6:40 PM
demon changed the visibility from "Custom Policy" to "Public (No Login Required)".
Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptJan 24 2018, 6:40 PM
demon added a comment.Jan 24 2018, 6:40 PM

This happened today, so went ahead and pushed a fix (hence the after-the-fact-UBN!)

Separate question from Reliability side, would it be possible in the short future to serve gerrit from more than one machine at a time- is it a stateless service, aside from the database?

demon added a comment.Jan 24 2018, 6:52 PM

No, it's not stateless:

  • Git repositories on disk. We replicate, but it's not instantaneous enough for me to trust dual-writes.
  • There's a local-disk cache, plus in-memory caches. I'd love to combine the two to use like Memcached or Redis (cf: T152802) which would allow us to remove some of the local state

Ideally we'd also solve T165631 first too.

Shorter term: I'd love to see us get the slave at least usable enough to where it can serve read-only traffic. This would allow us to swap Puppet's git::clone{} calls to use the slave to reduce some master load. This depends on T176532.

Thanks, FYI T176532 should be ready to start working on it (purchase proxies) after this quarter finishes, when all misc database services and the current eqiad proxies have been failed over to new servers and upgraded/restarted.