Page MenuHomePhabricator

Bring up gerrit2002
Closed, ResolvedPublic8 Estimated Story Points

Description

(Authored by @hashar)

The replica on gerrit2001 is a production service, it serves https://gerrit-replica.wikimedia.org/ and we have a few services relying on it https://wikitech.wikimedia.org/wiki/Gerrit/Replica . I think we started relying on it when Gerrit was running out of HEAP and that nicely offloaded the primary.

Currently we have:

gerrit1001 (primary)
   |
   v
gerrit2001 (replica)

Both hosts have Puppet role gerrit, the replica configuration is applied based on variables such as gerrit::is_replica.

We will want to add gerrit2002 as a replica. It will need the gerrit role and a few hiera settings to be set to make it a replica. Gerrit replication destinations are configured in

hieradata/role/common/gerrit.yaml
profile::gerrit::replication:
    github:
<snip>
    replica_codfw:
        url: 'gerrit2@gerrit2001.wikimedia.org:/srv/gerrit/git/${name}.git'
        mirror: true
        replicateProjectDeletions: true
        replicateHiddenProjects: true
        defaultForceUpdate: true
        threads: 4
        replicationDelay: 5
        rescheduleDelay: 5

The defined remote replica_codfw can take multiple URLs https://gerrit.wikimedia.org/r/plugins/replication/Documentation/config.md so we can probably do:

profile::gerrit::replication:
    replica_codfw:
        url:
            - 'gerrit2@gerrit2001.wikimedia.org:/srv/gerrit/git/${name}.git'
            - 'gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/${name}.git'

Or well create another target which might makes it easier to follow the replication to the different hosts. Notably https://grafana.wikimedia.org/d/RFLS1GsWk/replication-upstream is what I use to track replication and its on a per remote basis rather than on a per URL one. So well probably better to copy paste :\

Then we have:

gerrit1001 (primary)
   |        \________________
   |                         \
   v                         v
gerrit2001 (replica)      gerrit2002 (replica)

Once the replication to gerrit2002 has completed (I think it might takes 4/5 hours), we can switch the DNS entry for gerrit-replica.wikimedia.org from gerrit2001 to gerrit2002.

After the switch gerrit2001 will probably still receive requests (can be checked via /var/log/apache2/gerrit.wikimedia.org.https.access.log). There might be a long standing daemon pulling from it which could have the resolved IP cached.

After that gerrit2001 can be decommissioned.

Acceptance criteria
(cribbed from @hashar's notes)

  • apply the puppet role gerrit with gerrit::is_replica: true to gerrit2002
  • add gerrit2002 as a replica in the primary gerrit server's config (on gerrit1001)
  • Create working https://gerrit-replica-new.wikimedia.org (requires running gerrit service / webserver / certificate from acme_chief)
  • Replication is complete on gerrit2002
  • Switch the DNS entry for gerrit-replica.wikimedia.org from gerrit2001 to gerrit2002
  • Stop requests on gerrit2001
  • shut down and fully decom gerrit2001
  • Stretch: create a test to ensure replication is complete/make an alert

Event Timeline

Option 1 has 2 or 3 sub-options.

1a) - don't tell Gerrit anything specifically, only change DNS for gerrit-replica.wikimedia.org to point to the new IP / gerrit2002, wait, hopefully it just works

1b) - tell Gerrit specifically that gerrit-replica.wikimedia.org is now a new host (only applies if there are hardcoded IPs or the gerrit2001.codfw.wmnet names in config), do NOT create gerrit-replica-new.

1c) - create gerrit-replica-new and switch Gerrit to replicate only to that..but still have the option to fall back to gerrit-replica for a while. eventually switch gerrit-replica-new to gerrit-replica and shut down gerrit-replica-old

  1. - tell Gerrit to start replicating 2 replicas instead of just one. create gerrit-replica-new and let it replicate to that AND keep replicating to the old host. Then verify everything is fine. Then switch gerrit-replica-new to gerrit-replica. Then shut down gerrit-replica-old. Might need code changes to allow more than 1 replica (in puppet and in gerrit config)

This task follow up the discussion we had during the Release-Engineering-Team (The Decommission Mission 💀) meeting yesterday. I kind of missed the point that the goal is to replace solely the Gerrit replica, I assumed we had to replace both the primary and the replica. I thus proposed a migration with a few more steps to ensure we ended up with a working primary.

The replica on gerrit2001 is a production service, it serves https://gerrit-replica.wikimedia.org/ and we have a few services relying on it https://wikitech.wikimedia.org/wiki/Gerrit/Replica . I think we started relying on it when Gerrit was running out of HEAP and that nicely offloaded the primary.

Currently we have:

gerrit1001 (primary)
   |
   v
gerrit2001 (replica)

Both hosts have Puppet role gerrit, the replica configuration is applied based on variables such as gerrit::is_replica.

We will want to add gerrit2002 as a replica. It will need the gerrit role and a few hiera settings to be set to make it a replica. Gerrit replication destinations are configured in

hieradata/role/common/gerrit.yaml
profile::gerrit::replication:
    github:
<snip>
    replica_codfw:
        url: 'gerrit2@gerrit2001.wikimedia.org:/srv/gerrit/git/${name}.git'
        mirror: true
        replicateProjectDeletions: true
        replicateHiddenProjects: true
        defaultForceUpdate: true
        threads: 4
        replicationDelay: 5
        rescheduleDelay: 5

The defined remote replica_codfw can take multiple URLs https://gerrit.wikimedia.org/r/plugins/replication/Documentation/config.md so we can probably do:

profile::gerrit::replication:
    replica_codfw:
        url:
            - 'gerrit2@gerrit2001.wikimedia.org:/srv/gerrit/git/${name}.git'
            - 'gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/${name}.git'

Or well create another target which might makes it easier to follow the replication to the different hosts. Notably https://grafana.wikimedia.org/d/RFLS1GsWk/replication-upstream is what I use to track replication and its on a per remote basis rather than on a per URL one. So well probably better to copy paste :\

Then we have:

gerrit1001 (primary)
   |        \________________
   |                         \
   v                         v
gerrit2001 (replica)      gerrit2002 (replica)

Once the replication to gerrit2002 has completed (I think it might takes 4/5 hours), we can switch the DNS entry for gerrit-replica.wikimedia.org from gerrit2001 to gerrit2002.

After the switch gerrit2001 will probably still receive requests (can be checked via /var/log/apache2/gerrit.wikimedia.org.https.access.log). There might be a long standing daemon pulling from it which could have the resolved IP cached.

After that gerrit2001 can be decommissioned.

dancy renamed this task from Decide how to bring up Gerrit2002 to Bring up Gerrit2002.Jul 19 2022, 9:39 PM
dancy updated the task description. (Show Details)
dancy updated the task description. (Show Details)

Change 815395 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] add gerrit-replica-new.wikimedia.org, point to 208.80.153.109

https://gerrit.wikimedia.org/r/815395

we can switch the DNS entry for gerrit-replica.wikimedia.org from gerrit2001 to gerrit2002.

Unfortunately this is not a CNAME or service name that points to either gerrit2001 or gerrit2002 like it is with other services.

In this case the gerrit server actually has 2 IPs. So on gerrit2001:

root@gerrit2001:/home/dzahn# ip a s 
..
2: eno1: ...
    inet 208.80.153.106/27 brd 208.80.153.127 scope global eno1
    inet 208.80.153.107/32 scope global eno1

and on gerrit2002 just "208.80.153.102" right now.

So what we will need is:

  • reserve another name, "gerrit-replica-new" or whatever in DNS (but we can't just do that in the DNS repo alone, we also need to add placeholders in netbox to make sure nothing else automatically gets that IP, and keep DNS repo and netbox in sync).
  • use puppet to add the second IP to the interface on gerrit2002

This happens with code such as:

interface::alias { 'gerrit server':
    ipv4 => $ipv4,
    ipv6 => $ipv6,
}

in modules/profile/manifests/gerrit.pp

If we just add the gerrit role on the new host..we actually create an IP address conflict and break things. But if we set the values correctly in hieradata/hosts/ by hostname.. we can have it different on gerrit2001 vs gerrit2002.

class profile::gerrit(
    Hash                              $ldap_config       = lookup('ldap', Hash, hash, {}),
    Stdlib::IP::Address::V4           $ipv4              = lookup('profile::gerrit::ipv4'),
    Optional[Stdlib::IP::Address::V6] $ipv6              = lookup('profile::gerrit::ipv6'),

We have to go through all settings in hieradata/role/common/gerrit.yaml and check whether it's ok if it gets applied on gerrit2002, does not cause conflicts or issues with replication/ secondary IP etc) or not.

The values that have to change on gerrit2002 need to go into hieradata/hosts/gerrit2002.yaml. A new file. They will override the defaults in the gerrit.yaml.

Change 815396 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit role and hiera settings for replica to gerrit2002

https://gerrit.wikimedia.org/r/815396

Change 815397 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] acme_chief: add gerrit2002 to hosts allowed to fetch TLS certs

https://gerrit.wikimedia.org/r/815397

Change 815398 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit2002 to firewall rules for cluster support

https://gerrit.wikimedia.org/r/815398

Change 815400 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit2002 to puppetized known_hosts file

https://gerrit.wikimedia.org/r/815400

Change 815401 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add hiera data for a second replica

https://gerrit.wikimedia.org/r/815401

It will need the gerrit role and a few hiera settings to be set to make it a replica.

Yes, I agree with the overall plan but it's more complicated here as well. gerrit2001 shows up in these places:

  • hosts allowed to pull TLS certs for gerrit/gerrit-replica.wikimedia.org

https://gerrit.wikimedia.org/r/c/operations/puppet/+/815397

  • puppetized known_hosts file on gerrit master server:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/815400

  • add to firewall rules to allow ssh between gerrit servers for cluster support

https://gerrit.wikimedia.org/r/c/operations/puppet/+/815400

  • add hiera data for a second replica (have not added code yet to actually read it)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/815401

  • actually add the puppet role on gerrit2002 together with custom hiera settings based on host name so that we don't break things by adding the same IP address twice on different hosts.. and other stuff

https://gerrit.wikimedia.org/r/c/operations/puppet/+/815396/

P.S.:

give all of you shell access to gerrit2002 before the role is fully on it, if useful

https://gerrit.wikimedia.org/r/c/operations/puppet/+/815402

Change 815402 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] admin/gerrit: add gerrit shell admins on gerrit2002

https://gerrit.wikimedia.org/r/815402

Change 815402 merged by Vgutierrez:

[operations/puppet@production] admin/gerrit: add gerrit shell admins on gerrit2002

https://gerrit.wikimedia.org/r/815402

we can switch the DNS entry for gerrit-replica.wikimedia.org from gerrit2001 to gerrit2002.

Unfortunately this is not a CNAME or service name that points to either gerrit2001 or gerrit2002 like it is with other services.

In this case the gerrit server actually has 2 IPs. So on gerrit2001:

root@gerrit2001:/home/dzahn# ip a s 
..
2: eno1: ...
    inet 208.80.153.106/27 brd 208.80.153.127 scope global eno1
    inet 208.80.153.107/32 scope global eno1

and on gerrit2002 just "208.80.153.102" right now.

Sorry I had a false assumption. I can never remember why we have two IP address on the Gerrit hosts. Either we wanted to be able to move the service IP for the primary from one host to the other or it was to support ssh on port 22 for:

  • OpenSsh to reach the server
  • Gerrit internal ssh service (currently on port 29418)

Short of digging into why we have two public IP, I guess the easier is to allocate a second public IP to the new gerrit2002 host which I guess is what you plan to do given you have mentioned Netbox. Given once the old gerrit2001 is decommissioned its two public IP can be reclaimed.

Ahmon has a nice improvement suggestion for ssh_known_hosts at https://gerrit.wikimedia.org/r/c/operations/puppet/+/815400/ which I have replied to, I think it should be done in a later change after we have finished the migration.

I have reviewed / +1 all the patches proposed yesterday.

We should later revisit why the Gerrit hosts require two public IP. I can not find the reason for that, would have to dig further later.

I believe the last step is to get a second IP and after that lets go!

Some deployment notes which can probably be moved to the task description:

Once Gerrit is up on the new host and Prometheus has polled it, it should show up on the replication dashboard at https://grafana.wikimedia.org/d/RFLS1GsWk/replication-upstream (well hopefully).

When the Gerrit configuration on the primary has been applied to add the new replica_codfw2 configuration, the plugin has to be reloaded by an Administrator. Those are members of the LDAP group gerritadmin. The reload command is:

ssh -p 29418 gerrit.wikimedia.org gerrit plugin reload replication

I am hoping the plugin reload would cause a full replication to start toward the new replica.

The replication progress can be watched through the above dashboard or on the primary from the logs:

ssh gerrit1001.wikimedia.org tail -F /var/log/gerrit/replication_log

If the replication did not start after the plugin reload, the easiest is to restart the Gerrit primary (ssh gerrit1001.wikimedia.org sudo systemctl restart gerrit). When Gerrit primary starts, it triggers a full replication to all replicas. It takes a few hours to complete, but should start almost immediately against all replicas.

There is some documentation at https://wikitech.wikimedia.org/wiki/Gerrit/Administration#Replication

Change 815397 merged by Dzahn:

[operations/puppet@production] acme_chief: add gerrit2002 to hosts allowed to fetch TLS certs

https://gerrit.wikimedia.org/r/815397

Change 815398 merged by Dzahn:

[operations/puppet@production] gerrit: add gerrit2002 to firewall rules for cluster support

https://gerrit.wikimedia.org/r/815398

Change 815400 merged by Dzahn:

[operations/puppet@production] gerrit: add gerrit2002 to puppetized known_hosts file

https://gerrit.wikimedia.org/r/815400

Change 817841 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit2002 to list of migration dest hosts

https://gerrit.wikimedia.org/r/817841

https://puppet-compiler.wmflabs.org/pcc-worker1002/36495/gerrit2002.wikimedia.org/index.html

from https://gerrit.wikimedia.org/r/c/operations/puppet/+/817841

turned gerrit2002 into a monitored production host. (Icinga monitoring for IPMI, dhclient, firewall, raid, ssh,etc)

also added firewall rules and an rsyncd so that we can start syncing data over to it.

Change 817841 merged by Dzahn:

[operations/puppet@production] gerrit: turn gerrit2002 into a gerrit migration dest host

https://gerrit.wikimedia.org/r/817841

on gerrit2002 we now have, created by the migration class:

  • a group "gerrit2"
  • a user "gerrit2"
  • a directory /srv/gerrit
  • package rsync installed, /etc/default/rsync configured
  • 3 rsync fragments created: frag-gerrit-data frag-gerrit-home frag-gerrit-var-lib that allow syncing specific paths
  • /srv/home-gerrit1001.wikimedia.org/ created
  • host added to gerrit "contacts"

The 3 rsync modules (config fragments) are:

  • gerrit-data: path: /srv/gerrit
  • gerrit-home: path: /srv/home-gerrit1001.wikimedia.org
  • gerrit-var-lib: /var/lib/gerrit2/review_site

The hosts_allow line is gerrit1001.wikimedia.org localhost for all of them.

Finally the firewall rule that has been created via puppet->ferm->iptables is ACCEPT tcp -- gerrit1001.wikimedia.org anywhere tcp dpt:rsync.

This means an rsyncd is listening on the new host and is ready to have data pushed to it.. from the prod gerrit server gerrit1001. (with the current settings).

Mentioned in SAL (#wikimedia-operations) [2022-07-28T18:28:19Z] <mutante> gerrit: rsyncing /home from prod gerrit1001 to /srv/home-gerrit1001.wikimedia.org on gerrit2002 new replica T243027 T313250

Krinkle renamed this task from Bring up Gerrit2002 to Bring up gerrit2002.Jul 28 2022, 8:46 PM

Change 815395 merged by Dzahn:

[operations/dns@master] add gerrit-replica-new.wikimedia.org, point to 208.80.153.104

https://gerrit.wikimedia.org/r/815395

new in DNS:

[authdns1001:~] $ host gerrit-replica-new.wikimedia.org
gerrit-replica-new.wikimedia.org has address 208.80.153.104
gerrit-replica-new.wikimedia.org has IPv6 address 2620:0:860:4:208:80:153:104

Change 815396 merged by Dzahn:

[operations/puppet@production] gerrit: add hiera settings and IP for new replica gerrit2002

https://gerrit.wikimedia.org/r/815396

Dzahn changed the task status from Open to In Progress.Jul 29 2022, 8:47 PM
Dzahn triaged this task as High priority.

rsync of /srv/gerrit from gerrit1001 to gerrit2002 is still running (with bandwidth limit). I will just let it run over the weekend, it's in a screen on gerrrit1001 as root

Mentioned in SAL (#wikimedia-operations) [2022-08-01T21:02:57Z] <mutante> gerrit2002 - mkdir /var/lib/gerrit2/review_site | gerrit1001 - rsyncing /var/lib/gerrit2/review_site/ to gerrit2002 T313250 T313972

Mentioned in SAL (#wikimedia-operations) [2022-08-02T19:11:23Z] <mutante> gerrit1001 - rsyncing /home/ to gerrit2002:/srv/home-gerrit1001.wikimedia.org T313250

Mentioned in SAL (#wikimedia-operations) [2022-08-02T20:38:01Z] <mutante> re-imaging gerrit2002 with buster - because it's on bullseye, needs git-fat and that has not been ported to python3 yet which blocks upgrading gerrit machines otherwise T313250 T243027 T279509

Mentioned in SAL (#wikimedia-operations) [2022-08-02T22:15:31Z] <mutante> gerrit - syncing data (/srv/gerrit /var/lib/gerrit2/review_site /home) again after gerrit2002 was reimaged with buster T313250 T313972

Icinga downtime and Alertmanager silence (ID=f988e085-2640-4894-8bf4-b5840e774f99) set by dzahn@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: in setup / flapping

gerrit2002.wikimedia.org

Change 820185 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] acme_chief: add gerrit-replica-new to SNI list

https://gerrit.wikimedia.org/r/820185

Change 820185 merged by Dzahn:

[operations/puppet@production] acme_chief: add gerrit-replica-new to SNI list

https://gerrit.wikimedia.org/r/820185

Change 815401 merged by Dzahn:

[operations/puppet@production] gerrit: add hiera data for a second replica

https://gerrit.wikimedia.org/r/815401

Mentioned in SAL (#wikimedia-operations) [2022-08-03T20:07:17Z] <mutante> gerrit - adding second replica T313250

Mentioned in SAL (#wikimedia-operations) [2022-08-04T00:03:28Z] <mutante> gerrit - service restart to deploy config change to add second replica T313250

Mentioned in SAL (#wikimedia-operations) [2022-08-04T00:06:37Z] <mutante> gerrit - [2022-08-04 00:05:33,173] Replication to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/analytics/geowiki.git started.. T313250

Change 820249 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: remove hiera data for old replica

https://gerrit.wikimedia.org/r/820249

https://gerrit-replica-new.wikimedia.org/ is now a working webserver (we got gerrit service to keep running and not flap and fixed a cert issue)

Change 820474 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Update the known host key for gerrit2002.wikimedia.org

https://gerrit.wikimedia.org/r/820474

Change 820474 merged by Jbond:

[operations/puppet@production] Update the known host key for gerrit2002.wikimedia.org

https://gerrit.wikimedia.org/r/820474

Change 820573 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerit: remove old replica on gerrit2001 from gerrit config

https://gerrit.wikimedia.org/r/820573

Change 820577 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] gerrit: switch gerrit-replica to new machine, remove replica-new

https://gerrit.wikimedia.org/r/820577

Change 820577 merged by Dzahn:

[operations/dns@master] gerrit: switch gerrit-replica to new machine, remove replica-new

https://gerrit.wikimedia.org/r/820577

Mentioned in SAL (#wikimedia-operations) [2022-08-04T23:06:58Z] <mutante> switching gerrit-replica.wikimedia.org to new machine gerrit2002, dropping gerrit-replica-new.wikimedia.org T313250

Change 820573 merged by Dzahn:

[operations/puppet@production] gerit: remove old replica on gerrit2001 from gerrit config

https://gerrit.wikimedia.org/r/820573

Mentioned in SAL (#wikimedia-operations) [2022-08-05T00:18:41Z] <mutante> restarting gerrit for config change - removing old replica T313250

Change 820249 abandoned by Dzahn:

[operations/puppet@production] gerrit: remove hiera data for old replica

Reason:

duplicate

https://gerrit.wikimedia.org/r/820249

Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)

gerrit2002.wikimedia.org is https://gerrit-replica.wikimedia.org and active.