etherpad.wikimedia.org/etherpad1001 is currently running jessie.
|operations/dns : master||remove etherpad1001.eqiad.wmnet|
|operations/puppet : production||install_server: remove etherpad1001 from DHCP|
|operations/dns : master||switch discovery record for etherpad from 1001 to 1002|
|operations/puppet : production||site: remove etherpad1001|
|operations/puppet : production||trafficserver/cache: switch backend for etherpad to etherpad1002|
|operations/dns : master||remove etherpad-new.wikimedia.org|
|operations/puppet : production||trafficserver/cache: add etherpad-new -> etherpad1002|
|operations/puppet : production||ssl: update TLS cert for etherpad, add etherpad1002, etherpad-new|
|operations/dns : master||add IP for etherpad1002|
|operations/puppet : production||install_server: add etherpad1002 to netboot/partman|
|operations/dns : master||add etherpad-new.wikimedia.org|
|operations/puppet : production||site: add etherpad role to etherpad1002|
|operations/debs/etherpad-lite : master||Rebuild for buster|
Hm, etherpad stores a lot of data in memory (because ueberDB, see https://github.com/ether/etherpad-lite/issues/2826) before flushing it to the database, it's not exactly written to be scaled out. That is, people using this might end up causing database corruption. So, since this seems to work, it's probably best to finish the migration as soon as possible and kill this public URL.
I 've removed the DNS and stopped and masked the service for now on etherpad1002. Since we proved it works, let's just move over to etherpad1002.eqiad.wmnet, stopping beforehand etherpad1001 (to avoid the issues I alluded to). etherpad is anyway best effort, it's ok to even have an extended downtime.
Pad that per logs have been accessed on https://etherpad-new.wikimedia.org
90D1o-quuUNWqCrt0CIV WMCS-2019-06-25 WMCS-2020-01-22 WMCS-2020-02-04 WMCS-2020-02-05 aXjrQTK8PD6bjj9TqK4Q
if those were at the same times accessed on https://etherpad.wikimedia.org as well, what their content will be is undefined.
@Dzahn, I 've merged the required remaining changes to get the migration done. Now etherpad.wikimedia.org uses etherpad1002. Checked a couple of pads, it seems everything is fine. Hopefully we have no corruption issues. etherpad1001 is now removed from site.pp and I 've removed the etherpad-lite debian package from it. I 've also -2ed the discovery record changes due to the issue above about the software not supporting scaling out. I guess what's left is to decomission and delete that VM.
cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: etherpad1001.eqiad.wmnet
- etherpad1001.eqiad.wmnet (FAIL)
- Downtimed host on Icinga
- No management interface found (likely a VM)
- Wiped bootloaders
- Shutdown issued. Verify it manually, verification not yet supported
- Set Netbox status on VM not yet supported: manual intervention required
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
ERROR: some step on some host failed, check the bolded items above
@akosiaris Excellent! Thanks for all that.
I should not have merged the varnish change to actually enable etherpad-new.wm.org though before letting you review. As you say it is best effort though and i'm glad we could keep all the existing pads and hope there is no corruption.
I finished the decom of the VM and removed it from DHCP and DNS after gnt-instance remove.