Page MenuHomePhabricator

biterg.io Gerrit crawling probably stresses the server too much
Closed, ResolvedPublic

Description

While investigating the various Gerrit issue we have, notably with memory usage and garbage collection, I eventually found @Owl bot.

Every 30 seconds it does a query such as:

gerrit query limit:100 (status:open OR status:closed)
AND NOT (502978 OR 478688 OR 266726 OR 266725 OR 256050 OR 235852 OR 143269 OR 102114 OR 101842 OR 99101
OR 58858 OR 32902 OR 13930 OR 225009 OR 225033 OR 225080 OR 225087 OR 225099) --all-approvals --comments --format JSON --start 145500

On September 30th some change have been made and the limit is now set to 10. That would in turn mean the bot is crawling for wayy longer.

I have disabled the owl account entirely for now to check whether it improve Gerrit memory / gc pressure. Will reenable it eventually.

I would guess the limit should instead be raised. Probably to 200 or even more if that works.

Event Timeline

Mentioned in SAL (#wikimedia-releng) [2019-10-01T13:55:12Z] <hashar> Gerrit: disabled bot account "owl" , see whether that helps on the JVM gc issue T234328

Looking up for owl queries made over ssh using: grep 'owl.*gerrit\.query' /var/log/gerrit/sshd_log|cut -d\ -f-2

One request every 30 seconds:

[2019-10-01 12:24:07,060
[2019-10-01 12:24:35,913
[2019-10-01 12:25:06,550
[2019-10-01 12:25:36,660
[2019-10-01 12:26:03,846
[2019-10-01 12:26:37,820
[2019-10-01 12:27:10,626
[2019-10-01 12:27:39,347

No more queries for 7 minutes and it starts again:

[2019-10-01 12:34:48,070
[2019-10-01 12:35:31,996
[2019-10-01 12:36:08,491
[2019-10-01 12:36:44,865
[2019-10-01 12:37:16,262
[2019-10-01 12:37:46,711

And here is the corresponding CPU usage on Gerrit which shows a large drop when the bot has been stopped:

gerrit_cpu.png (308×707 px, 30 KB)

We lack proper metrics, but the bot causes Gerrit to load a ton of objects in memory which eventually triggers the garbage collector constantly and in turn uses a lot of CPU :/

Mentioned in SAL (#wikimedia-releng) [2019-10-01T13:55:12Z] <hashar> Gerrit: disabled bot account "owl" , see whether that helps on the JVM gc issue T234328

seems to have stopped GC from thrashing at any rate:

cpu2019-10-01.png (499×1 px, 38 KB)

Another option is to re-point the bitergia bot at https://gerrit-replica.wikimedia.org -- that would take a lot of load off of the primary gerrit server and (in theory) the replica should never be very out-of-date (order of magnitude of minutes at most).

@Aklapper, can you fix the configuration for bitergia to change from gerrit.wikimedia.org to gerrit-replica.wikimedia.org?

Thanks; I've raised an internal ticket with Bitergia to bring this issue to their attention and asked them t switch from gerrit.wm.o to gerrit-replica.wm.o (Note to myself: #69)

Thanks; I've raised an internal ticket with Bitergia to bring this issue to their attention and asked them t switch from gerrit.wm.o to gerrit-replica.wm.o (Note to myself: #69)

thanks! <3 bitergia

It seems like something may have changed about how they were crawling us last week sometime, causing a lot of load.

@Aklapper I have disabled the account to check whether the queries they did were causing too much load. Since that largely relaxed the CPU pressure I kept it disabled :-\

I would be interested to find out what kind of data they require for their analysis. Based on the query they are doing, it seems they just need the reviews infos / votes for all changes. Most probably that could be largely optimized by just cloning the repositories and then fetching the code review metadata.

Whenever a change is submitted by Gerrit and merged, it creates an entry under refs/notes/review. So one can:

clone and fetch:

git clone https://gerrit.wikimedia.org/test/gerrit-ping.git
git fetch origin refs/notes/review:refs/notes/review

Show the metadata which are now available locally:

$ git show --show-notes=review
commit c7cedd024b444219d4fc63a5210534dbec2771bb (origin/master, origin/HEAD)
Author: Antoine Musso <hashar@free.fr>
Date:   Tue Jan 8 21:38:15 2019 +0100

    Demo change for documentation
    
    Change-Id: I087c761076d9e34ac97412f3a281c464fdec753a

Notes (review):
    Verified+2: jenkins-bot
    Reviewed-on: https://gerrit.wikimedia.org/r/482886
    Project: test/gerrit-ping
    Branch: refs/heads/master

Which gives an audit trail of votes for the change.

Alternate command:

$ git notes --ref=review show HEAD
Verified+2: jenkins-bot
Reviewed-on: https://gerrit.wikimedia.org/r/482886
Project: test/gerrit-ping
Branch: refs/heads/master

The idea is that a single git fetch per repo would bring in all the metadata. They could then parse them locally which would make it way faster (saving time from the network round trip and Gerrit server side processing) :]

Bitergia asked internally:

can you double-check that the replica server (https://gerrit-replica.wikimedia.org/r/) is working? We got a not found error.

Bitergia asked internally:

can you double-check that the replica server (https://gerrit-replica.wikimedia.org/r/) is working? We got a not found error.

Hrm. I guess this depends on what requests they're making. The repos are all available there, but not the changes api it seems:

1(/^ヮ^)/*:・゚✧ curl https://gerrit.wikimedia.org/r/changes/operations%2fpuppet~310719/
2)]}'
3{
4 "id": "operations%2Fpuppet~production~I25089e194739e799e8f8c63fc74a70e25863524e",
5 "project": "operations/puppet",
6 "branch": "production",
7 "topic": "T135427",
8 "hashtags": [],
9 "change_id": "I25089e194739e799e8f8c63fc74a70e25863524e",
10 "subject": "Beta: Clean puppetmaster cherry-picks",
11 "status": "NEW",
12 "created": "2016-09-15 01:17:17.000000000",
13 "updated": "2019-01-16 19:32:11.000000000",
14 "submit_type": "FAST_FORWARD_ONLY",
15 "mergeable": false,
16 "insertions": 296,
17 "deletions": 0,
18 "unresolved_comment_count": 0,
19 "has_review_started": true,
20 "_number": 310719,
21 "owner": {
22 "_account_id": 2321
23 }
24}
25~
26(/^ヮ^)/*:・゚✧ curl https://gerrit-replica.wikimedia.org/r/changes/operations%2fpuppet~310719/
27Not Found~

One other question for bitergia folks: has anything changed about their requests on their side in the past week or so? It seems like this suddenly changed although I know I've seen the bot crawling us previously.

Indeed https://gerrit.wikimedia.org/r/ is empty since the replica does not offer a web interface for the users. But the git repositories are exposed nonetheless.

So when they are doing git clone / git fetches to https://gerrit.wikimedia.org/r/mediawiki/core.git , they should instead use:

https://gerrit-replica.wikimedia.org/r/mediawiki/core.git
              ^^^^^^^^

The changes crawling is done over ssh on port 29418. So they should just have to change the target host to gerrit-replica.wikimedia.org. Though they are account should be desactivated on that host as well, so we would have to set active again.

One other question for bitergia folks: has anything changed about their requests on their side in the past week or so? It seems like this suddenly changed although I know I've seen the bot crawling us previously.

Some changes have been done end of september, but overall I feel like we always had the issue. The crawling of all changes just put too much pressure, be it with a limit of 100 or a limit of 10 as has been recently changed.

@Aklapper I am definitely willing to assist them with the migration and answer whatever question they might have related to Gerrit. Either over IRC or a Google Meet :]

Bitergia currently run into:

$ ssh -p 29418 owl@gerrit-replica.wikimedia.org gerrit --version
Permission denied (publickey).

@hashar: Valcos sent you an email to potentially sort this out, if you might find some time (as I'm going to be less available for the next days)...?

Mentioned in SAL (#wikimedia-releng) [2019-10-09T20:01:47Z] <hashar> gerrit set-account owl --active # T234328

Sorry I have completely missed @Aklapper reply from last week and haven't paid attention to my emails. The reason was that I had marked the bot account owl inactive. That prevents ssh access and applies to both gerrit.wikimedia.org and gerrit-replica.wikimedia.org.

I have marked the account active and replied to the email.

Keeping this task open since we will want to monitor the CPU/JVM behavior on the Gerrit replica (gerrit2001.wikimedia.org).

Grafana overview: https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=gerrit2001&var-datasource=codfw%20prometheus%2Fops&var-cluster=misc

Even after I activated the account again on the master Gerrit, the replica still refuses connection over ssh:

owl - AUTH FAILURE FROM X.Y.Z.A inactive-account

I have flushed the replica accounts cache with: gerrit flush-caches --cache accounts. No idea whether it will work.

I could not find where the account is flagged as inactive. It is apparently not in All-Users.git. I also lack admin rights on the replica.

After flushing the accounts cache, the bot managed to ssh in \o/

It does queries such as:

gerrit.query.limit:10.(status:open.OR.status:closed).AND.NOT.(502978.OR.478688.OR.266726.OR.266725.OR.256050.OR.235852.OR.143269.OR.102114.OR.101842.OR.99101.OR.58858.OR.32902.OR.13930.OR.225009.OR.225033.OR.225080.OR.225087.OR.225099).--all-approvals.--comments.--format=JSON.--start=0

Might be worth checking with a limit of 100 instead.

Verified with Valerio. The issue was two folds:

  • account has been left deactivated
  • activating the account again on the master does not cause the replica to notice due to a local cache

The bot does crawl the replica over ssh now.

I have recommended to raise the limit from 10 to 100 or maybe even 250. That would make the crawling way faster.

There are some manual check done via the Gerrit UI, unfortunately it is not available on the replica ( https://gerrit-replica.wikimedia.org/r/ -> 404). But given they are manual requests, they can be done on the primary server instead ( https://gerrit.wikimedia.org/ ).

In the future, we will want to monitor the replica CPU usage at https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=gerrit2001&var-datasource=codfw%20prometheus%2Fops&var-cluster=misc

But it is most probably not going to be an issue since the replica is more powerful and handle way less tasks :]

On the replica:

fatal: gerrit: query: not found

Filled as T235251.

Meanwhile, I guess you can point back to the master gerrit.wikimedia.org and we will monitor its behavior.

Is this still a issue? We have a more powerful machine with more ram/cpu power now!

The account had been deactivated as part of the migration to a new server since we have restaured an old state of the account :-\ That got fixed though on Oct 24 2019 sometime after 22:00.

The bot is not actively crawling right now though. So lets keep this task open until the bot is back in action :]

The bot is less active in the recent days. We have upgraded the Gerrit server meanwhile and have lot more memory allocated to the JVM.

Lets just mark this as solved. If that turns out to be a concern again, we can just reopen it.