Page MenuHomePhabricator

Migrate the r/w LDAP servers to Bookworm and MDB storage
Open, MediumPublic

Description

This can also be used to address some other issues alongside via fresh installs (previously serpens/seaborgium had been dist-upgraded in place)

  • Rename away from the legacy naming scheme towards ldap-rw1001/ldap-rw2001
  • Move away from the deprecated BDB (T292942)
  • Create new ldap-rw1001/ldap-rw2001 VMs using Bookworm and set profile::openldap::storage_backend to "mdb" and configure them as a synchronisation pair
  • slapcat the existing data from serpens to an LDIF (ACLs, LDAP extensions are all distributed via Puppet)
  • slapadd the LDIF on ldap-rw1001 and let it sync towards ldap-rw2001
  • Create four additional ldap-replica VMs running Bookworm and sync them against ldap-rw1001/2001
  • Test the new setup
  • When everything works as expected in the parallel setup, revert the new Bookworm hosts to a clean state
  • Setup a window (1-2 hours) during which no r/w changes are possible (disable Bitu temporarily, tell SREs to avoid LDAP changes, disable Horizon)
  • Repeat the same import as above with current data, if all is well:
  • Point ldap-rw.codfw.w.o to ldap-rw2001
  • Point ldap-rw.eqiad.w.o to ldap-rw1001
  • Depool all older readonly replicas in favour of the new bookworm ones
  • If there are unforeseen issues we can simply revert to serpens/seaborgium/old replicas
  • If all is well, decom serpens/seaborgium and the old replicas

Event Timeline

Change 917363 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add partman config for ldap-rw* hosts

https://gerrit.wikimedia.org/r/917363

The current VMs are quite overdimensioned in terms of CPU core: I'd go with 4G RAM, 4 CPUs and 20G disk space instead for ldap-rw1001/2001

Change 917363 merged by Muehlenhoff:

[operations/puppet@production] Add partman config for ldap-rw* hosts

https://gerrit.wikimedia.org/r/917363

The current VMs are quite overdimensioned in terms of CPU core: I'd go with 4G RAM, 4 CPUs and 20G disk space instead for ldap-rw1001/2001

These seem reasonable to me

Change 917822 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ldap-rw[12]001 to site.pp

https://gerrit.wikimedia.org/r/917822

Change 917822 merged by Muehlenhoff:

[operations/puppet@production] Add ldap-rw[12]001 to site.pp

https://gerrit.wikimedia.org/r/917822

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host ldap-rw1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host ldap-rw1001.wikimedia.org with OS bullseye completed:

  • ldap-rw1001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202305091108_jmm_2558619_ldap-rw1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host ldap-rw2001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host ldap-rw2001.wikimedia.org with OS bullseye completed:

  • ldap-rw2001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202305091229_jmm_2655238_ldap-rw2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

@MoritzMuehlenhoff I think this updated plan makes sense, my only concern is with our use of mirrormode on. I don't have a good understanding of how mirror mode interacts with having more than one replica. I tried to find some good documentation, but openldap's docs are pretty sparse, https://www.openldap.org/doc/admin24/replication.html.

Hi, couple of comments from my side.

The plan above has a couple of caveats that unless more clearly detailed make it unfeasible.

Specifically, as @jhathaway says, mirrormode is a specific way to have multi-master in that it only allows 2 master and they need to pair with each other. Setting up true multi-master LDAP replication is a project that is quite possibly not worth doing (otherwise we would have done it years ago).

So, couple of suggestions

  • Setup replication between serpens and ldap-rw2001 and confirm all works fine needs to become Set ldap-rw2001 as a read only replica (role::openldap::replica). Setup replication between serpens and ldap-rw2001 and confirm all works fine.
  • Obviously, given the above, Point ldap-rw.codfw.w.o to ldap-rw2001 is infeasible to happen given the above change and needs to be moved to another step
  • Setup replication between seaborgium and ldap-rw1001 and confirm all works fine needs, similarly, to become Set ldap-rw2001 as a read only replica (role::openldap::replica). Setup replication between seaborgium and ldap-rw1001 and confirm all works fine
  • And of course Point ldap-rw.eqiad.w.o to ldap-rw1001 Point ldap-rw.eqiad.w.o to ldap-rw1001 needs to be moved to another step.

With the 2 new now hosts having the entire dataset and and confirmation that things work fine the next step is

  • Schedule a maint window, probably couple of hours long. During this window, it's best to avoid update/delete/add etc operations can happen. This will make it easier to rollback if things go bad. IIRC, the main thing here is to disable wikitech account creation for those couple of hours and inform clinic duty to not perform LDAP operations. There's a couple of other processes that can issue rw operations, e.g. password updates or emergency disabling/locking of wikitech/phabricator accounts. Adding @bd808 as they probably want to know about all of the above. I also welcome insights.

Regular read-only clients that want to perform read operations will be using the other replicas that exist (ldap-replica100[3-4] and ldap-replica200[5-6]). Apparently most of our puppet code is trying to do exactly that and we 've gone into some lengths to avoid having clients talk to masters unless needed. That being said, there is always the chance that something does exist that talks to the masters while it shouldn't. Hopefully that something will be something small that won't mind much for downtime.

In that window:

  1. Demote serpens to readonly replica. That is switch the role of serpens from role::openldap:rw to role::openldap::replica and make sure that it continues to sync from seaborgium. At this point, write operations (if any) to codfw will be failing. Grafana says there should be exactly 0, but let's be clear about this.
  2. Promote ldap-rw2001 to rw. That is, switch the role of ldap-rw2001 from role::openldap::replica to role::openldap::rw and make sure that it is now syncing from seaborgium. If my hiera reading is correct, that will also switch ldap-rw2001 to sync from seaborgium.
  3. The step Point ldap-rw.codfw.w.o to ldap-rw2001 gets moved here.
  4. Make sure that seaborgium is syncing now from ldap-rw2001. It should given the DNS change, but we will need a proper test with the addition of an entry and then removal of it.

And we are half done, rinse, repeat for seaborgium/ldap-rw1001. Here we have the twist, that eqiad will actually be seeing write operations fail. Judging by the same grafana link as above, we are talking, in a 1H period (I am rounding up, so exaggerating this already) probably about

  • ~1 ADD operation (account/group creation)
  • ~1 DELETE operation. Not sure what are these about. I rounded up for safety, but they are an order of magnitude less than Adds.
  • ~6 MODIFY operations. We are talking group modifications/account modifications here mostly IIRC
  • ~300 EXTENDED operations. I don't remember off hand what we are using extended operations for. I think it includes password changes, but since extended operations are a big bucket that includes everything not in the original protocol, it definitely includes other stuff.

The above numbers sounds pretty acceptable to me, but let me know if I am wrong.

  • Schedule a maint window, probably couple of hours long. During this window, it's best to avoid update/delete/add etc operations can happen. This will make it easier to rollback if things go bad. IIRC, the main thing here is to disable wikitech account creation for those couple of hours and inform clinic duty to not perform LDAP operations. There's a couple of other processes that can issue rw operations, e.g. password updates or emergency disabling/locking of wikitech/phabricator accounts. Adding @bd808 as they probably want to know about all of the above. I also welcome insights.

wikitech.wikimedia.org, Striker, and OpenStack Keystone can all write to the LDAP directory as the uid=novaadmin user (T218673). Developer account creations which can happen from both wikitech and striker are the most obvious write actions. There are however a number of other write paths including at least:

  • Toolforge membership approval via striker
  • Toolforge tool creation via striker
  • Toolforge tool deletion via striker & cli tools
  • Developer account password changes via wikitech
  • Developer account email changes via wikitech
  • Developer account indefinite duration blocks via wikitech
  • Developer account indefinite duration unblocks via wikitech
  • Cloud VPS ssh public key changes via either wikitech or striker
  • Cloud VPS project creation via cli tools
  • Cloud VPS project membership changes via cli tools and Horizon

The easiest way I can think of to block all of those actions would be to temporarily change the uid=novaadmin user's password. We don't have anything else that would be close to a unified off switch for all of these LDAP write workflows.

cc: @Andrew

The easiest way I can think of to block all of those actions would be to temporarily change the uid=novaadmin user's password. We don't have anything else that would be close to a unified off switch for all of these LDAP write workflows.

I believe at least for Keystone this would result the LDAP tree getting out of sync compared to its MariaDB database.

Thanks for all the input, much appreciated! I'll revise the plan and update the task in the next days.

One other option would be to simply start with a fresh, parallel setup and skip Bullseye entirely:

  • Create new ldap-rw1001/ldap-rw2001 VMs using Bookworm and set profile::openldap::storage_backend to "mdb" and configure them as a synchronisation pair
  • slapcat the existing data from serpens to an LDIF (ACLs, LDAP extensions are all distributed via Puppet)
  • slapadd the LDIF on ldap-rw1001 and let it sync towards ldap-rw2001
  • Create four additional ldap-replica VMs running Bookworm and sync them against ldap-rw1001/2001
  • Test the new setup
  • When everything works as expected in the parallel setup, revert the new Bookworm hosts to a clean state
  • Setup a window (1-2 hours) during which no r/w changes are possible (disable Bitu temporarily, tell SREs to avoid LDAP changes, disable Horizon)
  • Repeat the same import as above with current data, if all is well:
  • Point ldap-rw.codfw.w.o to ldap-rw2001
  • Point ldap-rw.eqiad.w.o to ldap-rw1001
  • Depool all older readonly replicas in favour of the new bookworm ones
  • If there are unforeseen issues we can simply revert to serpens/seaborgium/old replicas
  • If all is well, decom serpens/seaborgium and the old replicas

One other option would be to simply start with a fresh, parallel setup and skip Bullseye entirely:

  • Create new ldap-rw1001/ldap-rw2001 VMs using Bookworm and set profile::openldap::storage_backend to "mdb" and configure them as a synchronisation pair
  • slapcat the existing data from serpens to an LDIF (ACLs, LDAP extensions are all distributed via Puppet)
  • slapadd the LDIF on ldap-rw1001 and let it sync towards ldap-rw2001
  • Create four additional ldap-replica VMs running Bookworm and sync them against ldap-rw1001/2001
  • Test the new setup
  • When everything works as expected in the parallel setup, revert the new Bookworm hosts to a clean state
  • Setup a window (1-2 hours) during which no r/w changes are possible (disable Bitu temporarily, tell SREs to avoid LDAP changes, disable Horizon)
  • Repeat the same import as above with current data, if all is well:
  • Point ldap-rw.codfw.w.o to ldap-rw2001
  • Point ldap-rw.eqiad.w.o to ldap-rw1001
  • Depool all older readonly replicas in favour of the new bookworm ones
  • If there are unforeseen issues we can simply revert to serpens/seaborgium/old replicas
  • If all is well, decom serpens/seaborgium and the old replicas

Yes, this would work too. And it's probably faster and safer, with a clear fallback plan. Good thinking!

One other option would be to simply start with a fresh, parallel setup and skip Bullseye entirely:

I like this option as well. Couple of questions:

  1. How concerned are we about the 1-2 hour window where LDAP will be readonly? Would it be worth a test window of say 1/2 an hour to ascertain whether a 1-2 hour window is workable? I would hate to be in the middle of the operation and find out that 1-2 hours causes unacceptable service outages.
  2. How do we test the new setup, slapcat from the new servers after loading and run diff on the two versions?

One other option would be to simply start with a fresh, parallel setup and skip Bullseye entirely:

I like this option as well. Couple of questions:

  1. How concerned are we about the 1-2 hour window where LDAP will be readonly? Would it be worth a test window of say 1/2 an hour to ascertain whether a 1-2 hour window is workable? I would hate to be in the middle of the operation and find out that 1-2 hours causes unacceptable service outages.

We can do a smoke test beforehand (also to test e.g. the code paths to put the IDM into r/o mode), but in general our reliance on LDAP being r/w is fortunately not that large:

  • No new signups of developer accounts or changes to accounts are possible (e.g. password changes). Our rate of daily signups is approx two accounts, I don't have any numbers of account changes
  • No changes to Cloud project configurations are possible (e.g. making someone a new project admin for an OpenStack project)
  • SREs are unable to offboard people or perform other LDAP group changes (e.g. adding someone to cn=wmf)
  1. How do we test the new setup, slapcat from the new servers after loading and run diff on the two versions?

I was thinking of making a quick script which does the following:

  • Compare the data for a few select groups/users between one of the original servers (seaborgium) the new ldap-rw* and ldap-ro* servers
  • Compare the contextCSN between the new ldap-rw* and ldap-ro* servers and ensure that it's identical (to ensure replication of the slapadd completed)

If there's other things, I'm all ears :-)

MoritzMuehlenhoff renamed this task from Migrate the r/w LDAP servers to Bullseye to Migrate the r/w LDAP servers to Bookworm and MDB storage.Sep 8 2023, 10:55 AM

Change 955904 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add new LDAP replicas

https://gerrit.wikimedia.org/r/955904

Change 955904 merged by Muehlenhoff:

[operations/puppet@production] Add new LDAP replicas

https://gerrit.wikimedia.org/r/955904

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ldap-replica1005.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ldap-replica1005.wikimedia.org with OS bookworm completed:

  • ldap-replica1005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309081139_jmm_30763_ldap-replica1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ldap-replica1006.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ldap-replica1006.wikimedia.org with OS bookworm completed:

  • ldap-replica1006 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309081239_jmm_43729_ldap-replica1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ldap-replica2007.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ldap-replica2007.wikimedia.org with OS bookworm completed:

  • ldap-replica2007 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309150754_jmm_1856603_ldap-replica2007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ldap-replica2008.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ldap-replica2008.wikimedia.org with OS bookworm executed with errors:

  • ldap-replica2008 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309150850_jmm_1867768_ldap-replica2008.out
    • The reimage failed, see the cookbook logs for the details

Change 959200 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Configure ldap-rw1001/2001 as LDAP servers

https://gerrit.wikimedia.org/r/959200

Change 959201 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Extend acmechief config with new names of Bookworm hosts

https://gerrit.wikimedia.org/r/959201

Change 959203 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Create DNS records for new LDAP Bookworm cluster

https://gerrit.wikimedia.org/r/959203

Change 959213 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add new Bookworm LDAP replicas

https://gerrit.wikimedia.org/r/959213

Change 959215 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] conftool: Add new LDAP replicas

https://gerrit.wikimedia.org/r/959215

Change 959203 merged by Muehlenhoff:

[operations/dns@master] Create DNS records for new LDAP Bookworm cluster

https://gerrit.wikimedia.org/r/959203

Change 959201 merged by Muehlenhoff:

[operations/puppet@production] Extend acmechief config with new names of Bookworm hosts

https://gerrit.wikimedia.org/r/959201

Change 959200 merged by Muehlenhoff:

[operations/puppet@production] Configure ldap-rw1001/2001 as LDAP servers

https://gerrit.wikimedia.org/r/959200

Change 961047 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Fix Hiera entries for ldap master nodes

https://gerrit.wikimedia.org/r/961047

Change 961047 merged by Muehlenhoff:

[operations/puppet@production] Fix Hiera entries for ldap master nodes

https://gerrit.wikimedia.org/r/961047

Change 961066 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] On Bookworm ship ppolicy.schema via Puppet

https://gerrit.wikimedia.org/r/961066

Change 961188 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] slapd: introduce new slapd.conf template for ldap >= 2.5

https://gerrit.wikimedia.org/r/961188

I got an alert about ldap-rw2001 failing its backups (probably expected during setup), but wanted to give a heads up.

Change 961796 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Only install ppolicy.schema with OpenLDAP < 2.5

https://gerrit.wikimedia.org/r/961796

Change 961796 merged by Andrew Bogott:

[operations/puppet@production] Only install ppolicy.schema with OpenLDAP < 2.5

https://gerrit.wikimedia.org/r/961796

Change 961066 abandoned by Muehlenhoff:

[operations/puppet@production] On Bookworm ship ppolicy.schema via Puppet

Reason:

Correct fix was https://gerrit.wikimedia.org/r/c/operations/puppet/+/961796

https://gerrit.wikimedia.org/r/961066

Change 961188 abandoned by Andrew Bogott:

[operations/puppet@production] slapd: introduce new slapd.conf template for ldap >= 2.5

Reason:

Save change accomplished in https://gerrit.wikimedia.org/r/c/operations/puppet/+/961796

https://gerrit.wikimedia.org/r/961188