(Authored by @hashar)
The replica on gerrit2001 is a production service, it serves https://gerrit-replica.wikimedia.org/ and we have a few services relying on it https://wikitech.wikimedia.org/wiki/Gerrit/Replica . I think we started relying on it when Gerrit was running out of HEAP and that nicely offloaded the primary.
Currently we have:
gerrit1001 (primary) | v gerrit2001 (replica)
Both hosts have Puppet role gerrit, the replica configuration is applied based on variables such as gerrit::is_replica.
We will want to add gerrit2002 as a replica. It will need the gerrit role and a few hiera settings to be set to make it a replica. Gerrit replication destinations are configured in
profile::gerrit::replication: github: <snip> replica_codfw: url: 'gerrit2@gerrit2001.wikimedia.org:/srv/gerrit/git/${name}.git' mirror: true replicateProjectDeletions: true replicateHiddenProjects: true defaultForceUpdate: true threads: 4 replicationDelay: 5 rescheduleDelay: 5
The defined remote replica_codfw can take multiple URLs https://gerrit.wikimedia.org/r/plugins/replication/Documentation/config.md so we can probably do:
profile::gerrit::replication: replica_codfw: url: - 'gerrit2@gerrit2001.wikimedia.org:/srv/gerrit/git/${name}.git' - 'gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/${name}.git'
Or well create another target which might makes it easier to follow the replication to the different hosts. Notably https://grafana.wikimedia.org/d/RFLS1GsWk/replication-upstream is what I use to track replication and its on a per remote basis rather than on a per URL one. So well probably better to copy paste :\
Then we have:
gerrit1001 (primary) | \________________ | \ v v gerrit2001 (replica) gerrit2002 (replica)
Once the replication to gerrit2002 has completed (I think it might takes 4/5 hours), we can switch the DNS entry for gerrit-replica.wikimedia.org from gerrit2001 to gerrit2002.
After the switch gerrit2001 will probably still receive requests (can be checked via /var/log/apache2/gerrit.wikimedia.org.https.access.log). There might be a long standing daemon pulling from it which could have the resolved IP cached.
After that gerrit2001 can be decommissioned.
Acceptance criteria
(cribbed from @hashar's notes)
- apply the puppet role gerrit with gerrit::is_replica: true to gerrit2002
- add gerrit2002 as a replica in the primary gerrit server's config (on gerrit1001)
- Create working https://gerrit-replica-new.wikimedia.org (requires running gerrit service / webserver / certificate from acme_chief)
- Replication is complete on gerrit2002
- Switch the DNS entry for gerrit-replica.wikimedia.org from gerrit2001 to gerrit2002
- Stop requests on gerrit2001
- shut down and fully decom gerrit2001
- Stretch: create a test to ensure replication is complete/make an alert