Page MenuHomePhabricator

Upgrade etherpad.wikimedia.org to v1.9.7
Closed, ResolvedPublic

Description

Upgrade etherpad.wikimedia.org to the most recent version (v1.9.7).

Preparation work:

  • build new etherpad-lite Debian package for 1.9.7
  • build new prometheus-etherpad-exporter package
  • prepare new etherpad VM (bookworm, etherpad1004)
  • test etherpad-lite 1.9.7 on devtools
    • installation works (puppet run and installation for 1.9.7 + exporter)
    • mysql and proxy are missing
  • apply role(etherpad) to etherpad1004 and set profile::etherpad::service_ensure: stopped for etherpad1004
  • run puppet on etherpad1004 and verify successful installation
  • make Grafana dashboard compatible with multiple etherpad instances
  • announce maintenance windows some days in advance

Maintenance (switch from etherpad1003 to etherpad1004):

After maintenance:

  • stop etherpad1003 and decommission after grace period
  • apply etherpad role to replica in codfw

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+8 -0
operations/puppetproduction+1 -6
operations/puppetproduction+2 -2
operations/puppetproduction+4 -3
labs/privatemaster+3 -3
operations/puppetproduction+1 -5
operations/puppetproduction+1 -1
operations/dnsmaster+1 -1
operations/puppetproduction+0 -1
operations/puppetproduction+1 -0
operations/puppetproduction+8 -6
operations/puppetproduction+3 -1
operations/puppetproduction+49 -21
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+10 -3
operations/debs/prometheus-etherpad-exportermaster+6 -0
operations/debs/etherpad-litemaster+2 -3
operations/debs/etherpad-litemaster+10 -3
operations/debs/etherpad-litemaster+6 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think we might have a problem with the newer nodejs version and Debian Bullseye. The release notes of etherpad-lite 1.9.5 state:

This version deprecates NodeJS16 as it reached its end of life and won't receive any updates. So to get started with Etherpad v1.9.5 you need NodeJS 18 and above.

So we need nodejs (>= 18) in our Debian control file. I used the following file:

Source: etherpad-lite
Section: net
Priority: extra
Maintainer: Alexandros Kosiaris <akosiaris@wikimedia.org>
Build-Depends: debhelper (>= 11), nodejs (>= 18), npm (>= 9), libpq-dev
Standards-Version: 1.0

Package: etherpad-lite
Depends: nodejs (>= 18)
Architecture: all
Description: A web-based word processor that allows people to work
 together in real-time.
 .
 When multiple people edit the same document simultaneously, any changes are
 instantly reflected on everyone's screen. The result is a new and productive
 way to collaborate on text documents, useful for meeting notes, drafting
 sessions, education, team programming, and more.

But the build fails with:

The following packages have unmet dependencies:
 pbuilder-satisfydepends-dummy : Depends: nodejs (>= 18) but 12.22.12~dfsg-1~deb11u4 is to be installed

As far as I can tell Bullseye only has nodejs version 12.22. In our apt repo we also have 14.20 and 16.17 for bullseye. So I'm not fully sure how we get nodejs 18 to the bullseye hosts. Can we just backport nodejs 18 to bullseye-wikimedia? Or do we have to upgrade etherpad to bookworm?

I tried the build with --git-dist=bookworm and it worked fine, nodejs 18 is avaialable for bookworm.

As far as I can tell Bullseye only has nodejs version 12.22. In our apt repo we also have 14.20 and 16.17 for bullseye. So I'm not fully sure how we get nodejs 18 to the bullseye hosts. Can we just backport nodejs 18 to bullseye-wikimedia? Or do we have to upgrade etherpad to bookworm?

I think it's simplest to just move to bookworm and use the in distro version. The same for done for testreduce, which also needed a more recent nodejs/npm stack.

I think we might have a problem with the newer nodejs version and Debian Bullseye. The release notes of etherpad-lite 1.9.5 state:

This version deprecates NodeJS16 as it reached its end of life and won't receive any updates. So to get started with Etherpad v1.9.5 you need NodeJS 18 and above.

Oh no! I missed that one.

As far as I can tell Bullseye only has nodejs version 12.22. In our apt repo we also have 14.20 and 16.17 for bullseye. So I'm not fully sure how we get nodejs 18 to the bullseye hosts. Can we just backport nodejs 18 to bullseye-wikimedia? Or do we have to upgrade etherpad to bookworm?
I tried the build with --git-dist=bookworm and it worked fine, nodejs 18 is avaialable for bookworm.

Probably not worth it if bookworm works fine. It probably will be faster/easier to create a new VM with bookworm and apply the role with the package uploaded to apt.wikimedia.org for bookworm .

I created new VM etherpad1004 with bookworm. It currently has the "insetup" role applied and can be used. (T357159)

Change 999973 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add etherpad role to etherpad1004

https://gerrit.wikimedia.org/r/999973

Created etherpad-bookworm.devtools in wmcs, applied prod role there. Besides the obvious, missing etherpad-lite package, I noticed:

E: Unable to locate package prometheus-etherpad-exporter

The rest seems to just work.

Change 1002468 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/debs/etherpad-lite@master] bump changelog to 1.9.7

https://gerrit.wikimedia.org/r/1002468

Change 1002469 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/debs/etherpad-lite@master] bump nodejs and npm version

https://gerrit.wikimedia.org/r/1002469

Change 1002468 merged by Jelto:

[operations/debs/etherpad-lite@master] bump changelog to 1.9.7

https://gerrit.wikimedia.org/r/1002468

Change 1002469 merged by Jelto:

[operations/debs/etherpad-lite@master] bump nodejs and npm version

https://gerrit.wikimedia.org/r/1002469

Change 1002566 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/debs/etherpad-lite@master] fix typo, set debhelper-compat in control file

https://gerrit.wikimedia.org/r/1002566

Change 1002566 merged by Jelto:

[operations/debs/etherpad-lite@master] fix typo, set debhelper-compat in control file

https://gerrit.wikimedia.org/r/1002566

Mentioned in SAL (#wikimedia-operations) [2024-02-13T14:11:26Z] <jelto> import etherpad-lite 1.9.7-2 on apt host into bookworm-wikimedia - T316421

etherpad-lite 1.9.7 is available now also for bookworm. Thanks again for @MoritzMuehlenhoff and @akosiaris for troubleshooting issues with the build!

Next step is to rebuild prometheus-etherpad-exporter, which is also missing for bookworm as @Dzahn pointed out.

Change 1003007 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/debs/prometheus-etherpad-exporter@master] Release 0.7 prometheus-etherpad-exporter

https://gerrit.wikimedia.org/r/1003007

test instance etherpad-bookworm.devtools now has etherpad-lite 1.9.7-2 installed by puppet

Change 1003073 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: apply etherpad role on both eqiad and codfw

https://gerrit.wikimedia.org/r/1003073

Change 1003075 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: remove etherpad on bullseye machine

https://gerrit.wikimedia.org/r/1003075

Change 1003007 merged by Jelto:

[operations/debs/prometheus-etherpad-exporter@master] Release 0.7 prometheus-etherpad-exporter

https://gerrit.wikimedia.org/r/1003007

Mentioned in SAL (#wikimedia-operations) [2024-02-14T10:48:18Z] <jelto> import prometheus-etherpad-exporter 0.7 to bookworm-wikimedia on apt hosts - T316421

test instance etherpad-bookworm.devtools now has etherpad-lite 1.9.7-2 installed by puppet

and prometheus-etherpad-exporter 0.7 as well. The etherpad-lite package also installed nodejs 18.19 automatically.
We lack a proper mysql instance in our devtools project to do any more testing ([ERROR] ueberDB - Fatal MySQL error: Error: connect ECONNREFUSED 127.0.0.1:3306).

Next I'll create a small checklist for the switch from etherpad1003 (bullseye, 1.8.16) to etherpad1004 (bookworm, 1.9.7). We probably don't want two etherpad services running in parallel. So we should make sure to stop the etherpad process on the old machine before starting it on the new one. There is no puppet flag to enable or disable the process. So we can just disable puppet and stop the process on the old machine and then turn on etherpad on the new bookworm machine.

We probably don't want two etherpad services running in parallel.

We most definitely don't. Ever. etherpad caches very heavily things in memory, syncing semi-frequently to the database. Running multiple instances in parallel will lead into very weird effects.

It's pretty easy to see it happening in action using something like the following

version: "3.7"
services:
  mariadb:
    image: mariadb:latest
    restart: always
    environment:
      MARIADB_USER: example-user
      MARIADB_PASSWORD: my_cool_secret
      MARIADB_ROOT_PASSWORD: my-secret-pw
      MARIADB_DATABASE: etherpad
  alpha:
    image: etherpad/etherpad:1.8.16
    restart: always
    environment:
     DB_TYPE: mysql
     DB_HOST: mariadb
     DB_PORT: 3306
     DB_NAME: etherpad
     DB_USER: example-user
     DB_PASS: my_cool_secret
    ports:
      - 9001:9001/tcp
  beta:
    image: etherpad/etherpad:1.8.16
    restart: always
    environment:
     DB_TYPE: mysql
     DB_HOST: mariadb
     DB_PORT: 3306
     DB_NAME: etherpad
     DB_USER: example-user
     DB_PASS: my_cool_secret
    ports:
      - 9002:9001/tcp
  gamma:
    image: etherpad/etherpad:1.8.16
    restart: always
    environment:
     DB_TYPE: mysql
     DB_HOST: mariadb
     DB_PORT: 3306
     DB_NAME: etherpad
     DB_USER: example-user
     DB_PASS: my_cool_secret
    ports:
      - 9003:9001/tcp

to setup a cluster of 3 instances using the same backend, utilizing the 3 exposes instances on the 3 ports (9001, 9002, 9003) and watching how things are not in sync between the 3 tabs when editing the same pad. Simulating a random failure in one of the 3 instances is also possible to witness interesting data loss scenarios.

and prometheus-etherpad-exporter 0.7 as well. The etherpad-lite package also installed nodejs 18.19 automatically.

:)

We lack a proper mysql instance in our devtools project to do any more testing ([ERROR] ueberDB - Fatal MySQL error: Error: connect ECONNREFUSED 127.0.0.1:3306).

For Phabricator we solved this by installing mariadb on localhost. Even with puppet. The profile::mariadb::generic_server is the simple one that is unlike production mariadb but does the job for testing.

    # in cloud, use a local db server
    if $::realm == 'labs' {
        class { 'profile::mariadb::generic_server':
            datadir => $database_datadir,
        }
    }

...

Hiera config in Horizon:

phabricator::mysql::slave: localhost

Change 1003493 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] etherpad: add $service_ensure parameter

https://gerrit.wikimedia.org/r/1003493

Change 1003493 merged by Jelto:

[operations/puppet@production] etherpad: add $service_ensure parameter

https://gerrit.wikimedia.org/r/1003493

Change 1003769 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] etherpad: install mariadb server in wmcs

https://gerrit.wikimedia.org/r/1003769

Change 1004049 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] etherpad: set profile::etherpad::service_ensure in wmcs

https://gerrit.wikimedia.org/r/1004049

Change 1003769 merged by Jelto:

[operations/puppet@production] etherpad: install mariadb server in wmcs

https://gerrit.wikimedia.org/r/1003769

Change 1004049 merged by Jelto:

[operations/puppet@production] etherpad: set profile::etherpad::service_ensure in wmcs

https://gerrit.wikimedia.org/r/1004049

A test instance running etherpad-lite 1.9.7 is available on https://etherpad.wmcloud.org/ now. Creating a new pad works and the wmf welcome text is shown.

I had some trouble with the local mysql setup. While adding profile::mariadb::generic_server to the etherpad test instance in WMCS I encountered a bug. The innodb_buffer_pool_size is calculated using the following line in the default config:

innodb_buffer_pool_size = <%= (Float(@memorysize.split[0]) * 0.75).round %>G

However the etherpad test instance only had 1GB of memory. So the formula calculates 750G of required buffer pool size (instead of 750m). This of course fails and interrupts the correct installation and configuration of the instance.

As a workaround I increased the RAM to 4GB, which results in 3G of buffer pool size. I also reconfigured the mariadb installation using sudo dpkg-reconfigure mariadb-server. I think there is no real benefit in fixing this bug, as the default template is marked as "do not use it" and no mysql instance uses 1GB of memory. I might add a comment in the config template although to prevent anyone else to go over unnecessary troubleshooting.

So after that, the etherpad test instance has a local mysql instance running. I also had some trouble of setting ::passwords::etherpad_lite on the standalone puppetmaster in devtools (puppetmaster-1001.devtools.eqiad1.wikimedia.cloud). I tried different approaches but wasn't successful. So to unblock the testing I stopped puppet on the test instance and updated /etc/etherpad-lite/settings.json with the correct credentials. I'll have a chat with @Dzahn to properly set this credentials in WMCS as well.

The last part is fixing the prometheus-etherpad-exporter. It's failing with:

Traceback (most recent call last):
  File "/usr/bin/./prometheus-etherpad-exporter", line 183, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/bin/./prometheus-etherpad-exporter", line 173, in main
    start_http_server(int(port), addr=address)
  File "/usr/lib/python3/dist-packages/prometheus_client/exposition.py", line 169, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/prometheus_client/exposition.py", line 158, in _get_best_family
    infos = socket.getaddrinfo(address, port)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/socket.py", line 962, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -2] Name or service not known

I'll have a look at the export problem and then I'm confident to try a migration to the new bookworm etherpad host.

Jelto updated the task description. (Show Details)

The last part is fixing the prometheus-etherpad-exporter. It's failing with:

socket.gaierror: [Errno -2] Name or service not known

I'll have a look at the export problem and then I'm confident to try a migration to the new bookworm etherpad host.

that was a IPv6 error in WMCS. The exporter listens on :9198 by default. On WMCS this fails because there is no IPv6. When --listen is set to a available IPv4 address the exporter works just fine. So this should not be an issue on production.

I'll prepare and double-check all necessary changes for the checklist above and then we can try migrating to etherpad1004. I'm aiming for tomorrow or Thursday.

Change 999973 merged by Jelto:

[operations/puppet@production] site: add etherpad role to etherpad1004

https://gerrit.wikimedia.org/r/999973

Change 1005458 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] etherpad: make exporter and blackbox checks configurable

https://gerrit.wikimedia.org/r/1005458

Change 1005458 merged by Jelto:

[operations/puppet@production] etherpad: make exporter and blackbox checks configurable

https://gerrit.wikimedia.org/r/1005458

Change 1005946 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] etherpad: disable exporter auto restart when exporter is stopped

https://gerrit.wikimedia.org/r/1005946

Change 1005946 merged by Jelto:

[operations/puppet@production] etherpad: disable exporter auto restart when exporter is stopped

https://gerrit.wikimedia.org/r/1005946

Change 1005961 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] etherpad: stop etherpad service on etherpad1003

https://gerrit.wikimedia.org/r/1005961

Change 1005962 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] etherpad: start etherpad service on etherpad1004

https://gerrit.wikimedia.org/r/1005962

Change 1005963 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/dns@master] wmnet: switch etherpad to etherpad1004

https://gerrit.wikimedia.org/r/1005963

Change 1006004 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] etherpad: disable auto restart when etherpad is stopped

https://gerrit.wikimedia.org/r/1006004

Change 1006004 merged by Jelto:

[operations/puppet@production] etherpad: disable auto restart when etherpad is stopped

https://gerrit.wikimedia.org/r/1006004

Icinga downtime and Alertmanager silence (ID=cdbaf2d7-e922-40e0-b2c8-4f3374cb5b3d) set by jelto@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Upgrade etherpad and switch to bookworm

etherpad1003.eqiad.wmnet

Change 1005961 merged by Jelto:

[operations/puppet@production] etherpad: stop etherpad service on etherpad1003

https://gerrit.wikimedia.org/r/1005961

Change 1005962 merged by Jelto:

[operations/puppet@production] etherpad: start etherpad service on etherpad1004

https://gerrit.wikimedia.org/r/1005962

Change 1005963 merged by Jelto:

[operations/dns@master] wmnet: switch etherpad to etherpad1004

https://gerrit.wikimedia.org/r/1005963

Change 1006480 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] etherpad: fix exporter override and clear ExecStart first

https://gerrit.wikimedia.org/r/1006480

Change 1006480 merged by Jelto:

[operations/puppet@production] etherpad: fix exporter override and clear ExecStart first

https://gerrit.wikimedia.org/r/1006480

Change 1006523 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] prometheus::ops: WIP monitor active etherpad instance only

https://gerrit.wikimedia.org/r/1006523

Change 1003073 merged by Jelto:

[operations/puppet@production] site: apply etherpad role on both eqiad and codfw

https://gerrit.wikimedia.org/r/1003073

Change 1007331 had a related patch set uploaded (by Jelto; author: Jelto):

[labs/private@master] passwords: update etherpad labs

https://gerrit.wikimedia.org/r/1007331

Change 1007331 merged by Jelto:

[labs/private@master] passwords: update etherpad labs

https://gerrit.wikimedia.org/r/1007331

Change 1006523 merged by Jelto:

[operations/puppet@production] prometheus::ops: monitor active etherpad instance only

https://gerrit.wikimedia.org/r/1006523

Since the upgrade I believe that we are affected by https://github.com/ether/etherpad-lite/issues/5401. Wondering if a stale settings.json file got kept with padOptions.userName & userColor set to false instead of null.

Change 1007905 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] [etherpad] Set userName and userColor padOptions to null

https://gerrit.wikimedia.org/r/1007905

Change 1007905 merged by EoghanGaffney:

[operations/puppet@production] [etherpad] Set userName and userColor padOptions to null

https://gerrit.wikimedia.org/r/1007905

Icinga downtime and Alertmanager silence (ID=eff489f2-c167-46cc-8ac4-c471b433a777) set by jelto@cumin1002 for 10:00:00 on 1 host(s) and their services with reason: Shutdown and decommission old host

etherpad1003.eqiad.wmnet
Jelto added a subscriber: taavi.

Since the upgrade I believe that we are affected by https://github.com/ether/etherpad-lite/issues/5401. Wondering if a stale settings.json file got kept with padOptions.userName & userColor set to false instead of null.

Thanks for reporting the issue! This should be fixed by https://gerrit.wikimedia.org/r/1007905. Also thanks to @eoghan for deploying the fix.

Etherpad 1.9.7 runs on the new bookworm host etherpad1004 now. Also etherpad2002 is a passive replica in codfw. The old host is decommissioned. Docs (also for for the build process) were updated and future version upgrades should be less tricky.

I'll close the task. If there are any more issues please reopen the task.
Thanks again to @akosiaris, @MoritzMuehlenhoff and @taavi for all the help!

Change 1003075 abandoned by Dzahn:

[operations/puppet@production] site: remove etherpad on bullseye machine

Reason:

done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1008444

https://gerrit.wikimedia.org/r/1003075