Page MenuHomePhabricator

Gerrit VM to test data migration
Closed, ResolvedPublic

Description

We need to test migration of data from Gerrit schema 2.15 to Gerrit schema 2.16 using "real" data. Since the data is private, I can't do this on a labs machine. This task request a Ganeti VM to be used for those tests. It will be reclaimed once the migration has been completed.

Site/Location: eqiad
Number of systems: 1
Service: Gerrit

Networking Requirements: Ability to easily copy data from gerrit1001, access to gerrit sql
Processor Requirements: 8 (migration is IO bound, but getting an understanding of timing on "production-like" hardware would be ideal)

Memory: 16G
Disks: 80G (32 GB worth of git data + overhead)
Other Requirements: None
Project Duration: 3 weeks (hopefully less)

Details

Related Gerrit Patches:
operations/puppet : productiongerrit: allow multiple rsync destination hosts in migration class
operations/puppet : productionferm_misc/db: allow connections from gerrit1002 in ferm
operations/puppet : productiongerrit: assign host gerrit1002 role::gerrit
operations/puppet : productiongerrit: set gerrit host name and server list for gerrit1002/gerrit-test
operations/puppet : productionacme_chief/gerrit: remove gerrit-new, add gerrit1002
operations/puppet : productioninstall_server: update MAC address of gerrit1002
operations/dns : masteradd IPs for gerrit1002 in row C
operations/puppet : productionsite: replace gerrit-test with gerrit1002
operations/puppet : productioninstall_server: rename gerrit-test to gerrit1002
operations/puppet : productioninstall_server: add entries for gerrit-test
operations/puppet : productiongerrit: use 'gerritro' readonly db user on test server
operations/puppet : productiongerrit: make db_user configurable in Hiera
operations/puppet : productiongerrit: adjust bacula backup behaviour to deal with multiple hosts
operations/puppet : productionInitially assing spare role to gerrit-test.wikimedia.org
operations/puppet : productionbase/icinga: disable notifications and some monitoring for gerrit-test
operations/dns : masteradd IPv6 records for gerrit-test.wikimedia.org
operations/dns : masterdns: move gerrit-test to public1-c-eqiad
operations/dns : masterassign IPs for gerrit1002 and gerrit-test
operations/puppet : productioninstallserver: add gerrit1002 with flat/VM partman recipe

Related Objects

StatusSubtypeAssignedTask
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
OpenNone
OpenNone
ResolvedPaladox
OpenNone
OpenNone
OpenNone
StalledNone
OpenNone
OpenPaladox
ResolvedPaladox
OpenNone
OpenNone
OpenNone
StalledNone
OpenNone
Openthcipriani
ResolvedDzahn

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptNov 25 2019, 6:59 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Paladox assigned this task to Dzahn.Nov 25 2019, 7:04 PM
hashar updated the task description. (Show Details)Nov 26 2019, 1:18 PM
hashar added a subscriber: hashar.

About "access to gerrit sql", would it be sufficient to do a database dump from production and load that in a MySQL server local to the test VM?

hashar triaged this task as Medium priority.Nov 26 2019, 1:19 PM
thcipriani updated the task description. (Show Details)Nov 27 2019, 6:41 PM

Lowered the memory request as that seems to be out-of-line with the usage of most Ganeti VMs, hopefully 16G would work for a Ganeti VM?

About "access to gerrit sql", would it be sufficient to do a database dump from production and load that in a MySQL server local to the test VM?

Would be sufficient; might be preferable. My understanding of 2.16 is that it moves all data out of the database into notedb.

Change 553437 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] assign IPs for gerrit1002 and gerrit-test

https://gerrit.wikimedia.org/r/553437

Dzahn added a comment.Nov 28 2019, 2:13 AM

Lowered the memory request as that seems to be out-of-line with the usage of most Ganeti VMs, hopefully 16G would work for a Ganeti VM?

I think that should work. While still on the larger side there is a precedent for that. And we know it's just a temporary thing.

sufficient to do a database dump from production and load that in a MySQL server local to the test VM?
Would be sufficient; might be preferable.

I agree it sounds actually almost preferable because nothing can go wrong with the prod DB that way.

Change 553438 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: add gerrit1002 with flat/VM partman recipe

https://gerrit.wikimedia.org/r/553438

Change 553438 merged by Dzahn:
[operations/puppet@production] installserver: add gerrit1002 with flat/VM partman recipe

https://gerrit.wikimedia.org/r/553438

Change 553437 merged by Dzahn:
[operations/dns@master] assign IPs for gerrit1002 and gerrit-test

https://gerrit.wikimedia.org/r/553437

Dzahn added a comment.Dec 3 2019, 1:47 AM

Tried to create it but unfortunately:

Failure: prerequisites not met for this operation:
error type: insufficient_resources, error details:
Can't compute nodes using iallocator 'hail': Request failed: Group row_A (preferred): No valid allocation solutions, failure reasons: FailMem: 12

Tried to create it but unfortunately:

Failure: prerequisites not met for this operation:
error type: insufficient_resources, error details:
Can't compute nodes using iallocator 'hail': Request failed: Group row_A (preferred): No valid allocation solutions, failure reasons: FailMem: 12

Not that I know how to parse that error message, but it sounds like 16GB is too much ram still?

The way I understand the message: the virtualization servers in group row_A lack free memory to allocate a VM. But maybe another group would have memory available?

You should be able to list the group and free memory with:

sudo gnt-node list -o name,group,mfree

You can try another group, else bring down memory ? :)

The old puppetdb hosts (puppetdb1001) should be ready to go away, @jbond merged the patches to stop broadcasting to it last week. It also has 16G RAM, so those would be freed up when that's done.

Or simply use one of the spare baremetal hosts temporarily?

Dzahn added a comment.Dec 4 2019, 10:51 PM

The way I understand the message: the virtualization servers in group row_A lack free memory to allocate a VM. But maybe another group would have memory available?
You should be able to list the group and free memory with:

sudo gnt-node list -o name,group,mfree

You can try another group, else bring down memory ? :)

There are only 2 groups, A and C. The lowest number on C is even lower than the lowest number on A.

Node                   Group MFree
ganeti1001.eqiad.wmnet row_C 27.2G
ganeti1002.eqiad.wmnet row_C 34.3G
ganeti1003.eqiad.wmnet row_C 19.5G
ganeti1004.eqiad.wmnet row_C 12.9G
ganeti1005.eqiad.wmnet row_A 22.5G
ganeti1006.eqiad.wmnet row_A 19.2G
ganeti1007.eqiad.wmnet row_A 19.2G
ganeti1008.eqiad.wmnet row_A 25.3G
Dzahn added a comment.Dec 4 2019, 10:55 PM

The old puppetdb hosts (puppetdb1001) should be ready to go away, @jbond merged the patches to stop broadcasting to it last week. It also has 16G RAM, so those would be freed up when that's done.

That would be great.

Or simply use one of the spare baremetal hosts temporarily?

Unfortunately not that simple. It leads to hardware request tickets, discussion and waiting for approval to use them, even if temporary.
We could possibly ask to use phab1001 though after switch to phab1003 is completed.

Dzahn added a comment.Dec 5 2019, 6:29 PM

Just a thought.. If the 2 machines for contint are granted (T239880) first and are taken from available spare pool.. then possibly one of them could work as temp Gerrit test machine and then become contint?

Dzahn removed Dzahn as the assignee of this task.Dec 5 2019, 6:31 PM

I'm unassigning from me for now because i'll be on a vacation and don't want this to be blocked on me for no reason. If anyone else can push this forward meanwhile one way or the other, please do. Otherwise i'll take it back later.

Just a thought.. If the 2 machines for contint are granted (T239880) first and are taken from available spare pool.. then possibly one of them could work as temp Gerrit test machine and then become contint?

would be great were it possible. Would give us a better idea of what production migration timing might be on realistic hardware.

Looks like we have a few options available:

  • Use some spare hardware that is already available
  • Setup a ganeti VM (also fine from my perspective)

Adding @herron to unstick us here since they're on clinic duty this week :)

Change 562564 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] dns: move gerrit-test to public1-c-eqiad

https://gerrit.wikimedia.org/r/562564

Change 562575 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] install_server: add entries for gerrit-test

https://gerrit.wikimedia.org/r/562575

Change 562564 merged by Herron:
[operations/dns@master] dns: move gerrit-test to public1-c-eqiad

https://gerrit.wikimedia.org/r/562564

Change 562575 merged by Herron:
[operations/puppet@production] install_server: add entries for gerrit-test

https://gerrit.wikimedia.org/r/562575

Change 562587 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] gerrit: assign host gerrit-test role::gerrit

https://gerrit.wikimedia.org/r/562587

herron added a comment.Tue, Jan 7, 7:22 PM

ganeti-test.wikimedia.org VM has been created on row_C, and I've uploaded a patch to assign it role::gerrit with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/562587/

Change 562619 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] base/icinga: disable notifications and some monitoring for gerrit-test

https://gerrit.wikimedia.org/r/562619

Change 562622 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add IPv6 records for gerrit-test.wikimedia.org

https://gerrit.wikimedia.org/r/562622

Change 562622 merged by Dzahn:
[operations/dns@master] add IPv6 records for gerrit-test.wikimedia.org

https://gerrit.wikimedia.org/r/562622

Change 562619 merged by Dzahn:
[operations/puppet@production] base/icinga: disable notifications and some monitoring for gerrit-test

https://gerrit.wikimedia.org/r/562619

Change 562639 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: adjust bacula backup behaviour to deal with multiple hosts

https://gerrit.wikimedia.org/r/562639

Change 562790 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Initially assing spare role to gerrit-test.wikimedia.org

https://gerrit.wikimedia.org/r/562790

Change 562790 merged by Muehlenhoff:
[operations/puppet@production] Initially assing spare role to gerrit-test.wikimedia.org

https://gerrit.wikimedia.org/r/562790

Change 562639 merged by Dzahn:
[operations/puppet@production] gerrit: adjust bacula backup behaviour to deal with multiple hosts

https://gerrit.wikimedia.org/r/562639

Change 562965 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ferm_misc/db: allow connections from gerrit-test in ferm

https://gerrit.wikimedia.org/r/562965

Change 563284 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: make db_user configurable in Hiera

https://gerrit.wikimedia.org/r/563284

Change 563284 merged by Dzahn:
[operations/puppet@production] gerrit: make db_user configurable in Hiera

https://gerrit.wikimedia.org/r/563284

Change 563302 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: use 'gerritro' readonly db user on test server

https://gerrit.wikimedia.org/r/563302

Change 563302 merged by Dzahn:
[operations/puppet@production] gerrit: use 'gerritro' readonly db user on test server

https://gerrit.wikimedia.org/r/563302

Change 565392 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: rename gerrit-test to gerrit1002

https://gerrit.wikimedia.org/r/565392

Change 565392 merged by Dzahn:
[operations/puppet@production] install_server: rename gerrit-test to gerrit1002

https://gerrit.wikimedia.org/r/565392

Change 565395 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: replace gerrit-test with gerrit1002

https://gerrit.wikimedia.org/r/565395

Change 565395 merged by Dzahn:
[operations/puppet@production] site: replace gerrit-test with gerrit1002

https://gerrit.wikimedia.org/r/565395

Change 565399 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add IPs for gerrit1002 in row C

https://gerrit.wikimedia.org/r/565399

Mentioned in SAL (#wikimedia-operations) [2020-01-16T22:38:41Z] <mutante> ganeti1003 - deleting VM gerrit-test (T239151)

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: gerrit-test.wikimedia.org

  • gerrit-test.wikimedia.org (FAIL)
    • Downtimed host on Icinga
    • No management interface found (likely a VM)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Failed to shutdown, manual intervention required: Cumin execution failed (exit_code=2)
    • Set Netbox status on VM not yet supported: manual intervention required
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 565399 merged by Dzahn:
[operations/dns@master] add IPs for gerrit1002 in row C

https://gerrit.wikimedia.org/r/565399

Dzahn added a comment.Fri, Jan 17, 8:35 PM

IP situation fixed!

server:

gerrit1002.wikimedia.org has address 208.80.154.75
gerrit1002.wikimedia.org has IPv6 address 2620:0:861:3:208:80:154:75

service:

gerrit-test.wikimedia.org has address 208.80.154.78
gerrit-test.wikimedia.org has IPv6 address 2620:0:861:3:208:80:154:78

Dzahn added a comment.Fri, Jan 17, 8:54 PM

recreating VM as gerrit1002 so that we can use gerrit-test as service name:

Creating new VM named gerrit1002.wikimedia.org in eqiad with row=C vcpu=1 memory=16 gigabytes disk=80 gigabytes link=public

Change 565708 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: update MAC address of gerrit1002

https://gerrit.wikimedia.org/r/565708

Change 565708 merged by Dzahn:
[operations/puppet@production] install_server: update MAC address of gerrit1002

https://gerrit.wikimedia.org/r/565708

Change 565715 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: set gerrit host name and server list for gerrit1002/gerrit-test

https://gerrit.wikimedia.org/r/565715

Change 565716 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] acme_chief/gerrit: remove gerrit-new, add gerrit1002

https://gerrit.wikimedia.org/r/565716

Change 565716 merged by Dzahn:
[operations/puppet@production] acme_chief/gerrit: remove gerrit-new, add gerrit1002

https://gerrit.wikimedia.org/r/565716

Change 565715 merged by Dzahn:
[operations/puppet@production] gerrit: set gerrit host name and server list for gerrit1002/gerrit-test

https://gerrit.wikimedia.org/r/565715

Change 562587 merged by Dzahn:
[operations/puppet@production] gerrit: assign host gerrit1002 role::gerrit

https://gerrit.wikimedia.org/r/562587

Dzahn added a comment.Tue, Jan 21, 8:29 PM

The VM is now usable. It has the role(gerrit) on it and no more puppet errors. It uses its own service name/IP:

https://gerrit-test.wikimedia.org

Shell access is automatically granted by the role to the same people who have it on the prod server.

Monitoring and backups should be disabled.

Gerrit is configured to only know about itself and not the other Gerrit servers.

There are 63G free on / including /srv

Dzahn added a comment.Tue, Jan 21, 8:31 PM

The mysql user has also been made configurable (along with backups / monitoring) and it is using:

104     hostname = m2-master.eqiad.wmnet
105     database = reviewdb
106     username = gerritro

Note the 'gerritro' read-only user.

Dzahn added a comment.Tue, Jan 21, 8:33 PM
 94     heapLimit = 5g
 95     slave = false

116     canonicalWebUrl = https://gerrit-test.wikimedia.org/r

218 [sshd]
219     listenAddress = gerrit-test.wikimedia.org:29418
220 
221     listenAddress = [2620:0:861:3:208:80:154:78]:29418
Dzahn closed this task as Resolved.Tue, Jan 21, 8:34 PM
Dzahn claimed this task.
Dzahn added a comment.Tue, Jan 21, 8:41 PM

The gerrit acmechief TLS cert has been updated to contain "gerrit-test" in addition to gerrit and gerrit-replica. The "gerrit-new" name has been removed from it. This affected all Gerrit servers, including prod gerrit1001 which has the new cert now.

Change 562965 merged by Dzahn:
[operations/puppet@production] ferm_misc/db: allow connections from gerrit1002 in ferm

https://gerrit.wikimedia.org/r/562965

Change 566367 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: allow multiple rsync destination hosts in migration class

https://gerrit.wikimedia.org/r/566367

Change 566367 merged by Dzahn:
[operations/puppet@production] gerrit: allow multiple rsync destination hosts in migration class

https://gerrit.wikimedia.org/r/566367

Mentioned in SAL (#wikimedia-releng) [2020-01-21T21:48:46Z] <mutante> gerrit - rsyncing git data from gerrit1001 to gerrit1002 (T239151)

Mentioned in SAL (#wikimedia-releng) [2020-01-21T22:09:48Z] <mutante> gerrit - rsyncing 'git' and 'plugin' data dirs and /var/lib/gerrit2/review_site/ from gerrit1001 to gerrit1002 WITH --delete T239151