Page MenuHomePhabricator

Migrate etherpad1001 to Buster
Closed, ResolvedPublic

Description

etherpad.wikimedia.org/etherpad1001 is currently running jessie.

Details

Related Gerrit Patches:
operations/dns : masterremove etherpad1001.eqiad.wmnet
operations/puppet : productioninstall_server: remove etherpad1001 from DHCP
operations/dns : masterswitch discovery record for etherpad from 1001 to 1002
operations/puppet : productionsite: remove etherpad1001
operations/puppet : productiontrafficserver/cache: switch backend for etherpad to etherpad1002
operations/dns : masterremove etherpad-new.wikimedia.org
operations/puppet : productiontrafficserver/cache: add etherpad-new -> etherpad1002
operations/puppet : productionssl: update TLS cert for etherpad, add etherpad1002, etherpad-new
operations/dns : masteradd IP for etherpad1002
operations/puppet : productioninstall_server: add etherpad1002 to netboot/partman
operations/dns : masteradd etherpad-new.wikimedia.org
operations/puppet : productionsite: add etherpad role to etherpad1002
operations/debs/etherpad-lite : masterRebuild for buster

Event Timeline

ArielGlenn triaged this task as Medium priority.Jun 11 2019, 7:54 AM
MoritzMuehlenhoff renamed this task from Migrate etherpad1001 to Stretch/Buster to Migrate etherpad1001 to Buster.Jan 7 2020, 9:34 AM
Dzahn added a subscriber: Dzahn.Jan 7 2020, 6:15 PM
Dzahn added a comment.Jan 21 2020, 7:15 PM

The following packages are used by the puppet role but so far missing on buster:

  • prometheus-etherpad-exporter
  • etherpad-lite

Change 566467 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/debs/prometheus-etherpad-exporter@master] Rebuild for Buster T224580

https://gerrit.wikimedia.org/r/566467

Change 566467 merged by Muehlenhoff:
[operations/debs/prometheus-etherpad-exporter@master] Rebuild for Buster T224580

https://gerrit.wikimedia.org/r/566467

Mentioned in SAL (#wikimedia-operations) [2020-01-22T08:45:12Z] <moritzm> upload prometheus-etherpad-exporter 0.2 to buster-wikimedia T224580

The following packages are used by the puppet role but so far missing on buster:

  • prometheus-etherpad-exporter

I rebuilt the exporter for Buster and uploaded to apt.wikimedia.org

Change 566517 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/debs/etherpad-lite@master] Rebuild for buster

https://gerrit.wikimedia.org/r/566517

Change 566517 merged by Alexandros Kosiaris:
[operations/debs/etherpad-lite@master] Rebuild for buster

https://gerrit.wikimedia.org/r/566517

Mentioned in SAL (#wikimedia-operations) [2020-01-22T14:18:54Z] <akosiaris> upload etherpad-lite_1.7.5-3 to apt.wikimedia.org buster-wikimedia/main T224580

The following packages are used by the puppet role but so far missing on buster:

  • prometheus-etherpad-exporter
  • etherpad-lite

I think both are done now.

Dzahn awarded a token.Jan 22 2020, 6:41 PM

Change 566628 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add IP for etherpad1002

https://gerrit.wikimedia.org/r/566628

Change 566629 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add etherpad-new.wikimedia.org

https://gerrit.wikimedia.org/r/566629

Change 566631 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: add etherpad1002 to netboot/partman

https://gerrit.wikimedia.org/r/566631

Change 566634 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add etherpad role to etherpad1002

https://gerrit.wikimedia.org/r/566634

Change 566635 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: remove etherpad1001

https://gerrit.wikimedia.org/r/566635

Change 566636 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] trafficserver/cache: add etherpad-new -> etherpad1002

https://gerrit.wikimedia.org/r/566636

Change 566637 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] trafficserver/cache: switch backend for etherpad to etherpad1002

https://gerrit.wikimedia.org/r/566637

Change 566638 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove etherpad-new.wikimedia.org

https://gerrit.wikimedia.org/r/566638

Change 566628 merged by Dzahn:
[operations/dns@master] add IP for etherpad1002

https://gerrit.wikimedia.org/r/566628

Change 566631 merged by Dzahn:
[operations/puppet@production] install_server: add etherpad1002 to netboot/partman

https://gerrit.wikimedia.org/r/566631

  • prometheus-etherpad-exporter
  • etherpad-lite

I think both are done now.

Wow that was so quick and not expected. Thanks a lot! :)

Change 566629 merged by Dzahn:
[operations/dns@master] add etherpad-new.wikimedia.org

https://gerrit.wikimedia.org/r/566629

Change 566634 merged by Dzahn:
[operations/puppet@production] site: add etherpad role to etherpad1002

https://gerrit.wikimedia.org/r/566634

Change 566899 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ssl: update TLS cert for etherpad, added etherpad1002

https://gerrit.wikimedia.org/r/566899

Change 566899 merged by Dzahn:
[operations/puppet@production] ssl: update TLS cert for etherpad, add etherpad1002, etherpad-new

https://gerrit.wikimedia.org/r/566899

Change 566906 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] switch discovery record for etherpad from 1001 to 1002

https://gerrit.wikimedia.org/r/566906

Change 566636 merged by Dzahn:
[operations/puppet@production] trafficserver/cache: add etherpad-new -> etherpad1002

https://gerrit.wikimedia.org/r/566636

Dzahn claimed this task.Jan 24 2020, 1:27 AM

Hm, etherpad stores a lot of data in memory (because ueberDB, see https://github.com/ether/etherpad-lite/issues/2826) before flushing it to the database, it's not exactly written to be scaled out. That is, people using this might end up causing database corruption. So, since this seems to work, it's probably best to finish the migration as soon as possible and kill this public URL.

Change 566638 merged by Alexandros Kosiaris:
[operations/dns@master] remove etherpad-new.wikimedia.org

https://gerrit.wikimedia.org/r/566638

Mentioned in SAL (#wikimedia-operations) [2020-01-24T09:29:43Z] <akosiaris> disable and mask etherpad-lite on etherpad1002 to avoid corruption issues. T224580

I 've removed the DNS and stopped and masked the service for now on etherpad1002. Since we proved it works, let's just move over to etherpad1002.eqiad.wmnet, stopping beforehand etherpad1001 (to avoid the issues I alluded to). etherpad is anyway best effort, it's ok to even have an extended downtime.

Pad that per logs have been accessed on https://etherpad-new.wikimedia.org

90D1o-quuUNWqCrt0CIV
WMCS-2019-06-25
WMCS-2020-01-22
WMCS-2020-02-04
WMCS-2020-02-05
aXjrQTK8PD6bjj9TqK4Q

if those were at the same times accessed on https://etherpad.wikimedia.org as well, what their content will be is undefined.

Change 566637 merged by Alexandros Kosiaris:
[operations/puppet@production] trafficserver/cache: switch backend for etherpad to etherpad1002

https://gerrit.wikimedia.org/r/566637

Change 566635 merged by Alexandros Kosiaris:
[operations/puppet@production] site: remove etherpad1001

https://gerrit.wikimedia.org/r/566635

@Dzahn, I 've merged the required remaining changes to get the migration done. Now etherpad.wikimedia.org uses etherpad1002. Checked a couple of pads, it seems everything is fine. Hopefully we have no corruption issues. etherpad1001 is now removed from site.pp and I 've removed the etherpad-lite debian package from it. I 've also -2ed the discovery record changes due to the issue above about the software not supporting scaling out. I guess what's left is to decomission and delete that VM.

Change 566906 abandoned by Dzahn:
switch discovery record for etherpad from 1001 to 1002

Reason:
https://phabricator.wikimedia.org/T224580#5828981

https://gerrit.wikimedia.org/r/566906

Change 567113 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: remove etherpad1001 from DHCP

https://gerrit.wikimedia.org/r/567113

Change 567113 merged by Dzahn:
[operations/puppet@production] install_server: remove etherpad1001 from DHCP

https://gerrit.wikimedia.org/r/567113

Change 567115 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove etherpad1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/567115

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: etherpad1001.eqiad.wmnet

  • etherpad1001.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • No management interface found (likely a VM)
    • Wiped bootloaders
    • Shutdown issued. Verify it manually, verification not yet supported
    • Set Netbox status on VM not yet supported: manual intervention required
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Mentioned in SAL (#wikimedia-operations) [2020-01-24T22:21:51Z] <mutante> shutting down etherpad1001 - service fully migrated to etherpad1002 - running decom cookbook on ganeti VM (T224580)

Mentioned in SAL (#wikimedia-operations) [2020-01-24T22:31:44Z] <mutante> ganeti1003 - sudo gnt-instance remove etherpad1001.eqiad.wmnet (T224580)

Change 567115 merged by Dzahn:
[operations/dns@master] remove etherpad1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/567115

@akosiaris Excellent! Thanks for all that.

I should not have merged the varnish change to actually enable etherpad-new.wm.org though before letting you review. As you say it is best effort though and i'm glad we could keep all the existing pads and hope there is no corruption.

I finished the decom of the VM and removed it from DHCP and DNS after gnt-instance remove.

Dzahn closed this task as Resolved.Jan 24 2020, 10:59 PM

one more jessie removed