Page MenuHomePhabricator

gerrit-replica 502/OOM again
Closed, ResolvedPublic

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

gerrit-replica is 502ing again, probably because of OOM.

Jun 10 20:06:09 codesearch6 docker[28181]: 2020/06/10 20:06:09 Failed to git fetch /data/data/vcs-12e933bde61f91eb6f3be5a28027f9dcbe79a111, see output below
Jun 10 20:06:09 codesearch6 docker[11226]: 2020/06/10 20:06:09 vcs pull error (Extension:Flow - https://gerrit-replica.wikimedia.org/r/mediawiki/extensions/Flow.git): exit status 128
Jun 10 20:06:09 codesearch6 docker[11226]: Continuing...
Jun 10 20:06:09 codesearch6 docker[11226]: fatal: unable to access 'https://gerrit-replica.wikimedia.org/r/mediawiki/extensions/Flow.git/': The requested URL returned error: 502
Legoktm renamed this task from Flow isn't included in `deployed` search results to gerrit-replica 502/OOM again.Jun 11 2020, 10:02 AM
Legoktm triaged this task as High priority.
Legoktm added a project: Gerrit.
Legoktm added subscribers: thcipriani, hashar.

10:21 < mutante> !log restarting gerrit on gerrit-replica (gerrit2001) - java.lang.OutOfMemoryError: Java heap space

On gerrit2001. seems like it ended up being out of java heap space at 10:13 UTC:

[2020-06-11 10:13:26,029] [HTTP-95987] ERROR com.google.gerrit.pgm.http.jetty.HiddenErrorHandler : Error in GET /r/mediawiki/skins/MonoBook.git/info/refs?service=git-upload-pack
java.lang.OutOfMemoryError: Java heap space

And restarted automagically?

[2020-06-11 10:22:56,113] [main] INFO  com.google.gerrit.server.cache.h2.H2CacheFactory : Enabling disk cache /var/lib/gerrit2/review_site/cache

The Java heap is set to 32G (-Xmx32g).

We had the issue earlier today apparently:

[2020-06-11 06:16:38,803] [HTTP-95844] WARN  org.eclipse.jetty.servlet.ServletHandler : Error for /r/mediawiki/extensions/TitleBlacklist.git/info/refs
java.lang.OutOfMemoryError: Java heap space

And I guess kept continuing until the service restarted somehow.

And restarted automagically?

No, see comment above.

hashar claimed this task.

And looks like a good amount of requests originate from codesearch6.codesearch.eqiad1.wikimedia.cloud which apparently tries to crawl the repository as fast as it can :-(

For reference:

usedMemory.png (499×1 px, 22 KB)

And looks like a good amount of requests originate from codesearch6.codesearch.eqiad1.wikimedia.cloud which apparently tries to crawl the repository as fast as it can :-(

It's set to crawl every hour, but because of how many repositories there are, it might actually be constantly crawling. I'll bump it up to every 90 minutes.

We could also have codesearch crawls from local copy of the repositories by replicating Gerrit repositories to the codesearch host. But I guess we would then want to promote codesearch to prod.

We could also have codesearch crawls from local copy of the repositories by replicating Gerrit repositories to the codesearch host. But I guess we would then want to promote codesearch to prod.

Is there documentation on how this would work? What requires it to be in prod?

We could also have codesearch crawls from local copy of the repositories by replicating Gerrit repositories to the codesearch host. But I guess we would then want to promote codesearch to prod.

Is there documentation on how this would work? What requires it to be in prod?

One implication of going to production is that it wouldn't be able to speak to the outside meaning it wouldn't be able to index github, etc. I'm not against moving it to production, just flagging up this problem.

We could also have codesearch crawls from local copy of the repositories by replicating Gerrit repositories to the codesearch host. But I guess we would then want to promote codesearch to prod.

Is there documentation on how this would work? What requires it to be in prod?

Config documentation is https://gerrit.wikimedia.org/r/plugins/replication/Documentation/config.md
Wikitech documentation is mainly about troubleshooting, but may be something valuable here as well: https://wikitech.wikimedia.org/wiki/Gerrit#Replication

We replicate to github and gerrit replica via the replication plugin, currently. We use one worker thread to do replication to both and this keeps up with traffic (with occasional delays on the order of magnitude of minutes). One concern might be stability; i.e., if one worker thread is tied up doing retries; however, we have room, likely, to add another replication worker.