Page MenuHomePhabricator

create bullseye VM for Etherpad upgrade (and upgrade it to 1.8.16)
Closed, ResolvedPublic

Description

We want to upgrade Etherpad but know that for that we also have to upgrade the distro version.

So this is about creating a VM for etherpad with Debian bullseye.

Since the current server is etherpad1002, this will be etherpad1003.

Turning this into an official request by pasting generated VM request ticket part from [https://wikitech.wikimedia.org/wiki/SRE/SRE_Team_requests#Virtual_machine_requests_(Production)]

We will follow the previous request for etherpad1002 which was T243475.

After that we can follow T224580 for the actual Etherpad upgrade and do it here or make a new ticket.


Cloud VPS Project Tested: n/a
Site/Location: eqiad
Number of systems: 1
Service: etherpad
Networking Requirements: internal
Processor Requirements: 1
Memory: 1GB
Disks: 15GB (used to be 10 but as Moritz has said on other tickets we should not cause problems my making VMs too small either)
Other Requirements: -

Event Timeline

Dzahn changed the task status from Open to In Progress.Jan 31 2022, 8:03 PM
Dzahn triaged this task as Medium priority.
Dzahn added a project: Wikimedia-Etherpad.
Dzahn added a subscriber: akosiaris.
dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1 --disk 15 --network private eqiad_C etherpad1003
Ready to create Ganeti VM etherpad1003.eqiad.wmnet in the ganeti01.svc.eqiad.wmnet cluster on row C with 1 vCPUs, 1GB of RAM, 15GB of disk in the private network.

..

START - Cookbook sre.ganeti.makevm for new host etherpad1003.eqiad.wmnet
Allocated IPv4 10.64.32.181/22
Set DNS name of IP 10.64.32.181/22 to etherpad1003.eqiad.wmnet
Allocated IPv6 2620:0:861:103:10:64:32:181/64 with DNS name etherpad1003.eqiad.wmnet
..

Change 758559 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: add MAC address for etherpad1003

https://gerrit.wikimedia.org/r/758559

Change 758559 merged by Dzahn:

[operations/puppet@production] DHCP: add MAC address for etherpad1003, use bullseye installer

https://gerrit.wikimedia.org/r/758559

Mentioned in SAL (#wikimedia-operations) [2022-01-31T21:15:39Z] <mutante> installed bullseye on new VM etherpad1003, signing puppet certs for etherpad1003.eqiad.wmnet - puppet error expected until we add the role (T300568)

Change 758560 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add etherpad1003 with insetup role

https://gerrit.wikimedia.org/r/758560

Change 758560 merged by Dzahn:

[operations/puppet@production] site: add etherpad1003 with insetup role

https://gerrit.wikimedia.org/r/758560

Change 758561 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] switch etherpad.discovery.wmnet to etherpad1003

https://gerrit.wikimedia.org/r/758561

Change 758562 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add etherpad role to etherpad1003

https://gerrit.wikimedia.org/r/758562

Dzahn renamed this task from create bullseye VM for Etherpad upgrade to create bullseye VM for Etherpad upgrade (and upgrade it:).Jan 31 2022, 9:23 PM
Dzahn removed Dzahn as the assignee of this task.

@akosiaris

Looking at the old ticket when we upgraded to buster, I don't want to repeat the mistake and run Etherpad on 2 servers at a time because it stores things in memory etc.

So.. for right now.. we have a VM with bullseye, the "insetup" role and more patches waiting in Gerrit for adding the role, switching discovery DNS name (which this is meanwhile) and maybe more to be able to add the role but keep the service masked.

Also, looking through the puppet repo I see we still have a bunch of "etherpad-lite" related entries in modules/profile/templates/mariadb/grants/dumps-eqiad-m1.sql.erb and modules/profile/templates/mariadb/grants/dumps* but all the IPs in there are dbprov* or dbproxy* hosts, none are the actual Etherpad server.

Then looked at the TLS cert and we have:

alt_names: ["etherpad.discovery.wmnet","etherpad.svc.eqiad.wmnet","etherpad.svc.codfw.wmnet","etherpad.wikimedia.org","etherpad1001.eqiad.wmnet","etherpad2001.codfw.wmnet", "etherpad1002.eqiad.wmnet", "etherpad-new.wikimedia.org"]

so etherpad1003 isn't on there but also I wonder if it would just work with the etherpad.discovery.wmnet name which we use in the ATS config nowadays. The other names are just there from the past or "just in case". Either way I can clean that up.

(Or we can make a puppet patch to allow it to apply role while keeping the service masked, use the "etherpad-new.wm.org" that is still on the cert for bullseye..etc)

@Dzahn

I 've pushed to gerrit the new git-buildpackage upstream changes (bumped first to 1.8.14 and 1.8.16). I 've bypassed code review for that part as it is upstream code and there isn't much of a point in reviewing tons of upstream code in gerrit.

My own changes are at: https://gerrit.wikimedia.org/r/q/project:operations/debs/etherpad-lite+branch:master+status:open and are awaiting review.

I 've already built the package resulting from those and uploaded to apt1001 under bullseye-wikimedia and tested (both 1.8.14 and 1.8.16) in docker containers (using dirtyDB, not mysql) and they seem to work. If anything weird shows ups, we 'll have to catch it in production I fear.

Looking at the old ticket when we upgraded to buster, I don't want to repeat the mistake and run Etherpad on 2 servers at a time because it stores things in memory etc.

Upstream has made some changes to the code and we can reevaluate, maybe running 2 servers is ok now. But this should happen in a controlled test environment, definitely not in production. So, good call.

So.. for right now.. we have a VM with bullseye, the "insetup" role and more patches waiting in Gerrit for adding the role, switching discovery DNS name (which this is meanwhile) and maybe more to be able to add the role but keep the service masked.

I was thinking we schedule a maint window, shutdown etherpad1002 during it, merge the role and discovery changes, run puppet and see if we need to fix anything or revert back to etherpad1002. This is a best-effort service with no uptime guarantees or SLOs. It's acceptable to schedule a say 2 hour (although even more is ok too) maint window with full downtime.

Also, looking through the puppet repo I see we still have a bunch of "etherpad-lite" related entries in modules/profile/templates/mariadb/grants/dumps-eqiad-m1.sql.erb and modules/profile/templates/mariadb/grants/dumps* but all the IPs in there are dbprov* or dbproxy* hosts, none are the actual Etherpad server.

Yup, those are unrelated to this.

Then looked at the TLS cert and we have:

alt_names: ["etherpad.discovery.wmnet","etherpad.svc.eqiad.wmnet","etherpad.svc.codfw.wmnet","etherpad.wikimedia.org","etherpad1001.eqiad.wmnet","etherpad2001.codfw.wmnet", "etherpad1002.eqiad.wmnet", "etherpad-new.wikimedia.org"]

so etherpad1003 isn't on there but also I wonder if it would just work with the etherpad.discovery.wmnet name which we use in the ATS config nowadays. The other names are just there from the past or "just in case". Either way I can clean that up.

I think we don't need etherpad1003 in that cert now that we have the .svc.$::site.wmnet and the .discovery.wmnet domains in the cert. It's probably a historical thing that etherpad1002 is there.

This is scheduled for Thursday, Feb 10 2022, 9 to 10.30 UTC. Added to SRE calendar "vendor maintenance", mailed ops-l and reached out to movement comms via their Slack channel to figure out if it should be in tech news (after I received a comment that I should consider that since last time Etherpad went down we did get several questions about it).

Change 761273 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/debs/prometheus-etherpad-exporter@master] Update for bullseye

https://gerrit.wikimedia.org/r/761273

Change 761273 merged by Alexandros Kosiaris:

[operations/debs/prometheus-etherpad-exporter@master] Update for bullseye

https://gerrit.wikimedia.org/r/761273

Mentioned in SAL (#wikimedia-operations) [2022-02-09T10:03:58Z] <akosiaris> T300568 upload prometheus-etherpad-exporter_0.4_amd64 to apt.wikimedia.org bullseye-wikimedia/main

Mentioned in SAL (#wikimedia-operations) [2022-02-09T10:45:27Z] <akosiaris> T300568 upload prometheus-etherpad-exporter_0.5_amd64 to apt.wikimedia.org bullseye-wikimedia/main

Mentioned in SAL (#wikimedia-operations) [2022-02-09T23:48:30Z] <mutante> apt1001 - delete etherpad-lite for bullseye source package, built, uploaded and imported 1.8.16-2 in bullseye-wikimedia, now source and binary packages in APT, simulated install on etherpad1003 works T300568

Change 758562 merged by Dzahn:

[operations/puppet@production] site: add etherpad role to etherpad1003

https://gerrit.wikimedia.org/r/758562

Change 761669 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: move etherpad1002 back to insetup role

https://gerrit.wikimedia.org/r/761669

Change 758561 merged by Dzahn:

[operations/dns@master] switch etherpad.discovery.wmnet to etherpad1003

https://gerrit.wikimedia.org/r/758561

Change 761672 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] etherpad: fix process monitoring after version upgrade, node->nodejs

https://gerrit.wikimedia.org/r/761672

Change 761672 merged by Dzahn:

[operations/puppet@production] etherpad: fix process monitoring after version upgrade, node->nodejs

https://gerrit.wikimedia.org/r/761672

Change 761662 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: remove etherpad1002

https://gerrit.wikimedia.org/r/761662

Change 761661 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: remove etherpad1002

https://gerrit.wikimedia.org/r/761661

Change 761680 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ssl: update cert for etherpad.discovery.wmnet

https://gerrit.wikimedia.org/r/761680

Change 761680 merged by Dzahn:

[operations/puppet@production] ssl: update cert for etherpad.discovery.wmnet

https://gerrit.wikimedia.org/r/761680

Change 761727 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] etherpad: allow setting listening IP in Hiera, use IPv6 on etherpad1003

https://gerrit.wikimedia.org/r/761727

Change 761727 merged by Dzahn:

[operations/puppet@production] etherpad: allow setting listening IP in Hiera, use IPv6 on etherpad1003

https://gerrit.wikimedia.org/r/761727

Mentioned in SAL (#wikimedia-operations) [2022-02-10T22:39:39Z] <mutante> etherpad - succesfully switched to etherpad1003 (bullseye) and etherpad 1.8.16 - on second attempt after making it listen on IPv6 to work behind envoy (T300568) - https://gerrit.wikimedia.org/r/c/operations/puppet/+/761727/

Change 761737 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] etherpad: make listening on IPv6 the default now

https://gerrit.wikimedia.org/r/761737

Change 761669 merged by Dzahn:

[operations/puppet@production] site: move etherpad1002 back to insetup role

https://gerrit.wikimedia.org/r/761669

Done. This is in use now in production and etherpad1002 does not have the etherpad role anymore.

Change 761737 merged by Dzahn:

[operations/puppet@production] etherpad: make listening on IPv6 the default now

https://gerrit.wikimedia.org/r/761737

What went wrong here at first:

When we switched from etherpad1002 to etherpad1003, etherpad itself worked (curl http://etherpad.discovery.wmnet:9001 returned content) but when you asked envoy it would fail with a 503 (curl https://etherpad.discovery.wmnet:7443 would return 503.

The reason for that was that envoy, having etherpad1003.eqiad.wmnet in its config file and etherpad1003 having an AAAA record in DNS, would try to connect via IPv6.

But Etherpad did not listen on IPv6 AND, unexpectely, envoy also did not transparently fall back to use IPv4 for some reason.

This did not happen on etherpad1002 because that server did not have a AAAA record.

So the options were:

  • remove AAAA record from etherpad1003, should have fixed it but did not want to do that because right now there are once again efforts to add v6 to everything, as always
  • investigate further why envoy does not try IPv4 if IPv6 does not work and there is also an A record for the same name that is used in the config (still interesting, probably)
  • instead of using the hostname etherpad1003.eqiad.wmnet in envoy config, change envoy puppetization / templates to allow us to use something like 127.0.0.1 (would potentially touch other services using envoy and require templating changes)
  • make Etherpad actually listen on IPv6

I picked the last option and found out if you set the "ip" parameter to empty value or "::" then Etherpad will actually listen on IPv6, but ONLY on IPv6, not on both IPv4 and IPv6, despite what some docs say. These docs also say rightfully the behaviour depends on the node version, which also changed as part of this upgrade.

Then seeing the Etherpad module class already had a parameter for this ip value I just had to pass i through to the profile and a Hiera lookup. That way etherpad1003 got "::" while etherpad1003 got "0.0.0.0" as before.

And this fixed it. Now the connection between envoy and etherpad is "IPv6 only" but it just works and since all this is behind caching servers it does not matter for end users.

Dzahn renamed this task from create bullseye VM for Etherpad upgrade (and upgrade it:) to create bullseye VM for Etherpad upgrade (and upgrade it to 1.8.16).Feb 10 2022, 11:37 PM
Dzahn closed this task as Resolved.
Dzahn claimed this task.

I picked the last option and found out if you set the "ip" parameter to empty value or "::" then Etherpad will actually listen on IPv6, but ONLY on IPv6, not on both IPv4 and IPv6, despite what some docs say. These docs also say rightfully the behaviour depends on the node version, which also changed as part of this upgrade.

Then seeing the Etherpad module class already had a parameter for this ip value I just had to pass i through to the profile and a Hiera lookup. That way etherpad1003 got "::" while etherpad1003 got "0.0.0.0" as before.

And this fixed it. Now the connection between envoy and etherpad is "IPv6 only" but it just works and since all this is behind caching servers it does not matter for end users.

Because etherpad.discovery.wmnet is a CNAME to etherpad1003.eqiad.wmnet, it resolves both A and AAAA records and as such I think that we should have the service listen on both IPv4 and IPv6 to prevent any future issue. My 2 cents.
My 2 cents.

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: etherpad1002.eqiad.wmnet

  • etherpad1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

Change 761662 merged by Dzahn:

[operations/puppet@production] site: remove etherpad1002

https://gerrit.wikimedia.org/r/761662

@Volans Yes, it has been fixed by making etherpad listen on "::" with https://gerrit.wikimedia.org/r/c/operations/puppet/+/761727

Ack, thanks. Sorry, I misread the previous comment.