Page MenuHomePhabricator

Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster)
Closed, ResolvedPublic

Description

Gerrit seems to be approaching the limit of its current hardware.

The CPU usage seems to be OK -- rarely over 50% -- I think when that happens it's usually trying to garbage collect memory rapidly and handle queued requests.

The Disk size seems fine as well: plenty of room and I rarely see any IO wait.

Memory is where the problem is for that machine.

At some point, the size of our repos exceeded the size of the heap we're able to allocate. We're at roughly 32GB of git repos on disk currently and we have 32GB of ram in the machine (20GB heap).

I think ideally we'd be able to fit all of the caches + indexes + a good portion of our git repos into memory at the same time as all the other gerrit objects. Caches don't seem to consume a large amount of memory (<1 GB; 1.1GB persisted to disk); we have 4GB set aside for packfiles exclusively (would be nice to up this if we had space); indexes are 2GB.

We run at a 95th%ile of 18GB of ram in use. The G1GC we're using keeps 10% headspace before triggering garbage collection (which explains the 18GB). We do; however, still manage to hit 20GB of ram in use occasionally. This is in-spite of doing weekly git gc.

I think it would be good to double the amount of ram for Gerrit to 64GB.

On the day:

  • Rsync /srv/gerrit/git/ , /srv/gerrit/plugins and /var/lib/gerrit2/review_site/ from cobalt to gerrit1001
  • Stop gerrit && disable puppet on gerrit1001
  • Merge mariadb::ferm_misc: allow connections from gerrit1001
  • Stop puppet on cobalt + gerrit2001
  • Merge Gerrit: Switch master from cobalt to gerrit1001
  • Merge Switch gerrit.wikimedia.org backend to gerrit1001
  • Stop gerrit on cobalt
  • repeat the rsync commands above (Rsync /var/lib/gerrit2/review_site to gerrit1001. Also rsync lfs objects again.)
  • Rename /var/lib/gerrit2/review_site/data/javamelody/r_cobalt to /var/lib/gerrit2/review_site/data/javamelody/r_gerrit1001 on gerrit1001.
  • Run puppet on gerrit1001 + cobalt
  • Start gerrit on gerrit1001
  • Hack DNS authdns-update to clone from gerrit-replica temporarily, deploy DNS change
  • Manually copy apache2 site config for gerrit.wm.org with scp from cobalt to gerrit1001, restart apache
  • Manually run command from list_mediawiki_extensions cron to create /var/www/mediawiki-extensions.txt
  • Run the online reindexer

Topic branch (open and merged) https://gerrit.wikimedia.org/r/q/topic:%22gerrit1001%22+(status:open%20OR%20status:merged)


Migration date is October 21st 2019.

https://lists.wikimedia.org/pipermail/wikitech-l/2019-October/092664.html


docs: https://wikitech.wikimedia.org/wiki/Gerrit#Migrating

Details

Due Date
Oct 21 2019, 7:00 PM
SubjectRepoBranchLines +/-
operations/puppetproduction+11 -2
operations/puppetproduction+13 -6
operations/puppetproduction+1 -1
operations/puppetproduction+2 -4
operations/puppetproduction+1 -1
operations/puppetproduction+18 -12
operations/puppetproduction+2 -6
operations/puppetproduction+2 -2
operations/dnsmaster+5 -4
operations/dnsmaster+2 -2
operations/puppetproduction+2 -7
operations/puppetproduction+1 -0
operations/puppetproduction+1 -2
operations/puppetproduction+5 -4
operations/puppetproduction+3 -0
operations/puppetproduction+7 -3
operations/puppetproduction+33 -1
operations/puppetproduction+4 -2
operations/dnsmaster+4 -0
operations/puppetproduction+1 -6
operations/puppetproduction+4 -4
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Dzahn raised the priority of this task from Medium to High.Oct 1 2019, 10:44 PM
Dzahn closed subtask T231046: setup/install gerrit1001 as Resolved.

Change 540240 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit::migration: let gerrit-root users ssh to new gerrit servers

https://gerrit.wikimedia.org/r/540240

Change 540240 merged by Dzahn:
[operations/puppet@production] gerrit::migration: let gerrit-root users ssh to new gerrit servers

https://gerrit.wikimedia.org/r/540240

@thcipriani You and the other members of gerrit-roots admin group can now ssh to gerrit1001.wikimedia.org.

Also i added the rsync setup and synced /srv/gerrit/git from cobalt to gerrit1001.

Change 540237 merged by Dzahn:
[operations/puppet@production] gerrit::migration: add firewall hole for rsync over IPv6

https://gerrit.wikimedia.org/r/540237

@thcipriani You and the other members of gerrit-roots admin group can now ssh to gerrit1001.wikimedia.org.

Also i added the rsync setup and synced /srv/gerrit/git from cobalt to gerrit1001.

\o/

Confirmed that I can ssh in and I can see /srv/gerrit/git

Change 540244 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit::migration: add base::firewall, rm superfluous ipresolve line

https://gerrit.wikimedia.org/r/540244

Mentioned in SAL (#wikimedia-operations) [2019-10-01T23:21:28Z] <mutante> gerrit1001 - chown -R gerrit2:gerrit2 /srv/gerrit/git/ (T222391)

Change 540244 merged by Dzahn:
[operations/puppet@production] gerrit::migration: add base::firewall, rm superfluous ipresolve line

https://gerrit.wikimedia.org/r/540244

Mentioned in SAL (#wikimedia-operations) [2019-10-01T23:28:46Z] <mutante> cobalt (gerrit) rsyncing /srv/gerrit/plugins dir, push to new server gerrit1001 (T222391)

Confirmed that I can ssh in and I can see /srv/gerrit/git

rsync / firewall setup is fixed now. the following commands work to sync ./git and ./plugins dirs.

I also synced the plugins dir.

dzahn@cobalt:/srv/gerrit/git$ rsync -avp /srv/gerrit/git/ rsync://gerrit1001.wikimedia.org/gerrit-data/git/

dzahn@cobalt:/srv/gerrit/git$ rsync -avp /srv/gerrit/plugins/ rsync://gerrit1001.wikimedia.org/gerrit-data/plugins/

still had to do

chown -R gerrit2:gerrit2 /srv/gerrit/git/
chown -R gerrit2:gerrit2 /srv/gerrit/plugins/

afterwards because as usual we run into the issue that UID is not the same for the gerrit2 user. i'll fix that though.

Change 540460 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: reinstall gerrit1001 with buster

https://gerrit.wikimedia.org/r/540460

Change 540460 merged by Dzahn:
[operations/puppet@production] install_server: reinstall gerrit1001 with buster

https://gerrit.wikimedia.org/r/540460

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

gerrit1001.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/201910021921_dzahn_70793_gerrit1001_wikimedia_org.log.

Completed auto-reimage of hosts:

['gerrit1001.wikimedia.org']

Of which those FAILED:

['gerrit1001.wikimedia.org']

Mentioned in SAL (#wikimedia-operations) [2019-10-02T21:03:56Z] <mutante> gerrit1001 changing UID of gerrit2 user to 114 and GID to 119 in /etc/passwd to match cobalt to avoid privilege issues after rsyncing data (T222391)

Mentioned in SAL (#wikimedia-operations) [2019-10-02T21:08:29Z] <mutante> gerrit1001 changing GID of gerrit2 user to 119 in /etc/group ; find / -uid 499 -exec chown gerrit2 {} \; find / -gid 1001 -exec chown gerrit2:gerrit2 {} \; (T222391)

Completed auto-reimage of hosts:

['gerrit1001.wikimedia.org']

Of which those FAILED:

['gerrit1001.wikimedia.org']

The reason it failed is "RuntimeError: Failed to reboot_host". It is unknown why. Just doing manual reboot.

Puppet run finished.

Mentioned in SAL (#wikimedia-operations) [2019-10-02T21:17:06Z] <mutante> cobalt (gerrit) rsyncing /srv/gerrit/git and /srv/gerrit/plugins data to gerrit1001 again after reinstall and fixing gerrit2 UID/GID (T222391)

Change 539204 merged by Dzahn:
[operations/puppet@production] gerrit: add role on gerrit1001 and remove gerrit::migration

https://gerrit.wikimedia.org/r/539204

Change 540514 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] fix IPv6 address for gerrit1001, wrong row

https://gerrit.wikimedia.org/r/540514

Change 540514 merged by Dzahn:
[operations/dns@master] fix IPv6 address for gerrit1001, wrong row

https://gerrit.wikimedia.org/r/540514

Change 540703 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] gerrit: add role on gerrit1001 and remove gerrit::migration

https://gerrit.wikimedia.org/r/540703

Change 540707 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] move gerrit-new service IP to B network

https://gerrit.wikimedia.org/r/540707

Change 540710 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit1001: fix service IP to IP in B network

https://gerrit.wikimedia.org/r/540710

Change 540707 merged by Dzahn:
[operations/dns@master] move gerrit-new service IP to B network

https://gerrit.wikimedia.org/r/540707

Change 540710 merged by Dzahn:
[operations/puppet@production] gerrit1001: fix service IP to IP in B network

https://gerrit.wikimedia.org/r/540710

Change 540703 merged by Dzahn:
[operations/puppet@production] gerrit: add role on gerrit1001 and remove gerrit::migration

https://gerrit.wikimedia.org/r/540703

Mentioned in SAL (#wikimedia-operations) [2019-10-04T20:32:04Z] <mutante> gerrit1001 - scp /usr/share/java/mysql-connector-java.jar from cobalt into /usr/share/java/ on gerrit1001 and then symlink into /var/lib/gerrit2/review_site/lib/ (T222391)

Paladox updated the task description. (Show Details)
Dzahn set Due Date to Oct 21 2019, 7:00 PM.Oct 10 2019, 8:11 PM
Paladox updated the task description. (Show Details)

Change 545016 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: add gerrit migration role to gerrit1001

https://gerrit.wikimedia.org/r/545016

Change 545016 abandoned by Dzahn:
gerrit: add gerrit migration role to gerrit1001

Reason:
duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/ /545024

https://gerrit.wikimedia.org/r/545016

Mentioned in SAL (#wikimedia-operations) [2019-10-21T20:08:02Z] <mutante> rsynced /srv/gerrit/git from cobalt to gerrit1001 (T222391)

Mentioned in SAL (#wikimedia-operations) [2019-10-21T20:08:14Z] <mutante> rsynced /srv/gerrit/plugins from cobalt to gerrit1001 (T222391)

Mentioned in SAL (#wikimedia-operations) [2019-10-21T20:12:21Z] <mutante> rsyncing /var/lib/gerrit2/review_site from cobalt to gerrit1001 (T222391)

Change 535966 merged by Dzahn:
[operations/puppet@production] mariadb::ferm_misc: allow connections from gerrit1001

https://gerrit.wikimedia.org/r/535966

Mentioned in SAL (#wikimedia-operations) [2019-10-21T20:29:24Z] <mutante> running puppet on dbproxy10017 to apply ferm change for gerrit db from gerrit1001 (T222391)

Change 541110 had a related patch set uploaded (by Dzahn; owner: Paladox):
[operations/puppet@production] Gerrit: Switch master from cobalt to gerrit1001

https://gerrit.wikimedia.org/r/541110

Change 541110 merged by Dzahn:
[operations/puppet@production] Gerrit: Switch master from cobalt to gerrit1001

https://gerrit.wikimedia.org/r/541110

Mentioned in SAL (#wikimedia-operations) [2019-10-21T21:16:42Z] <cdanis> previous cumin invocation was to unblock gerrit migration; will be automatically restored to usual on next puppet run. T222391

Mentioned in SAL (#wikimedia-operations) [2019-10-21T21:21:30Z] <mutante> copied apache config for gerrit.wm.org site from cobalt to gerrit1001, restarted apache2, ran puppet again. gerrit back up (T222391)

Mentioned in SAL (#wikimedia-operations) [2019-10-21T21:27:53Z] <mutante> gerrit1001 manually running command from "list_mediawiki_extensions" cron (T222391)

something to note/fix for future migrations: one option for how to push DNS changes when gerrit is down https://tools.wmflabs.org/sal/log/AW3wKZvNfYQT6VcDX2pS

Change 535966 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mariadb::ferm_misc: allow connections from gerrit1001

https://gerrit.wikimedia.org/r/535966

Mentioned in SAL (#wikimedia-operations) [2019-10-21T23:11:42Z] <mutante> rsynced operations/puppet.git/objects from cobalt to gerrit1001 (and backup in /root) (T222391)

Mentioned in SAL (#wikimedia-operations) [2019-10-22T16:14:04Z] <thcipriani> stopping gerrit to run a fix for T222391

Change 545342 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: change gerrit master_host to gerrit1001, remove duplicate

https://gerrit.wikimedia.org/r/545342

This is mostly done and all boxes are checked.

Though only really closing it after: T236114 is resolved, old server is decom'ed T236187 and we tuned config to make it use more RAM T225166.

Change 545381 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: increase heap_size from 20G to 32G

https://gerrit.wikimedia.org/r/545381

Change 545381 had a related patch set uploaded (by Paladox; owner: Dzahn):
[operations/puppet@production] gerrit: increase heap_size from 20G to 32G

https://gerrit.wikimedia.org/r/545381

Change 545381 merged by Dzahn:
[operations/puppet@production] gerrit: increase heap_size from 20G to 32G

https://gerrit.wikimedia.org/r/545381

Mentioned in SAL (#wikimedia-operations) [2019-10-24T00:03:18Z] <mutante> restarting gerrit to increase heap_size from 20G to 32G (T225166 T222391)

Dzahn removed a subtask: Restricted Task.

Change 545342 merged by Dzahn:
[operations/puppet@production] gerrit: change gerrit master_host to gerrit1001, remove duplicate

https://gerrit.wikimedia.org/r/545342

Change 547619 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: allow rsync of home dirs for server migrations

https://gerrit.wikimedia.org/r/547619

Change 547619 merged by Dzahn:
[operations/puppet@production] gerrit: allow rsync of home dirs for server migrations

https://gerrit.wikimedia.org/r/547619

Dzahn changed the status of subtask T236187: decom cobalt from Stalled to Open.Nov 1 2019, 6:12 PM
Dzahn changed the status of subtask T236187: decom cobalt from Open to Stalled.Nov 1 2019, 6:25 PM
Dzahn changed the status of subtask T236187: decom cobalt from Stalled to Open.Nov 5 2019, 11:49 PM