Page MenuHomePhabricator

Mailman Downtime: Migrate mailman from lists1001 to lists1004
Open, HighPublic

Description

Mailman will be down for approximately 2 hours on Tuesday, 18th of June from 10:00 UTC to 12:00 UTC to faciliate migration to a new host. The extent of the downtime will be that mailing list delivery and the web archives will be stopped for most of the duration of the window, and search availability may be intermittent for several hours. Mail will be delayed and delivered later, and should not be lost.

The rough outline for migration is:

1: stop mail arriving inbound, wait for queues to clear out
2: migrate data, VIPs and service from old host to new host
3: run the required upgrade steps
4: test web UI on new host
5: allow mail to arrive inbound

More detailed step-by-step plan for migrating from the old hosts to the new host (lists1001 -> lists1004):

Prep:

  • Downtime all mailman hosts (cumin: sudo cookbook sre.hosts.downtime --hours 2 -r 'Mailman migration' -t T367521 'O:lists')
  • Merge puppet change to block incoming mail on lists1001 and lists1004 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1043799)
  • Stop puppet on all hosts (cumin: sudo cumin 'A:lists' 'sudo disable-puppet "Mailman migration"')
  • Ensure the queue is empty on lists1001 (lists1001: sudo find /var/lib/mailman3/queue/{in,out} | wc -l)
  • Stop mailman on lists1001 (lists1001: sudo systemctl stop mailman3; systemctl stop mailman3-web)

(Now is a good time to check the management console on lists1001 and lists1004, maybe open a shell on both)

Migrate:

  • Ensure data is synced from lists1001 to lists1004/lists2001 (/usr/bin/rsync -avp --delete rsync://lists1001.wikimedia.org/var-lib-mailman3-sync /srv/mailman3)
  • Merge CR migrating VIPs from lists1001, and switching primary host to lists1004 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1036610)
  • Run puppet agent on lists1001, ensure VIPs are removed and exim4 config does not contain the lists VIPs for routing mail (lists1001: sudo grep 208.80.154.21 /etc/exim4/exim4.conf)
  • Merge CR to move the mariadb ferm rules from lists1001 to a reference to the primary host variable (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1046785)
  • Run puppet agent on lists1004, ensure VIPs are added and exim4 config does contains the lists VIPs (lists1004: sudo grep 208.80.154.21 /etc/exim4/exim4.conf)

Post-upgrade:

  • Run the following post-upgrade steps on the new host, lists1004:
    • mailman-web migrate
    • mailman-web compress
    • mailman-web collectstatic
    • mailman-web compilemessages
    • mailman-web rebuild_index (may not be needed, test if archive search works before running this)

Restore:

  • Start mailman-web on lists1004 and verify (lists1004: sudo systemctl start mailman-web)
  • Test mail delivery locally (lists1004: echo "Mailman delivery test" /usr/bin/mail -r "<FROM ADDRESS>" -s "Mailman delivery test post-migration" -a "Auto-Submitted: auto-generated" <DESTINATION LIST>")
  • Merge puppet change to unblock incoming mail on lists1004 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1046786)
  • Re-enable puppet on all hosts (cumin: sudo cumin 'A:lists' 'sudo puppet agent --enable)

Rolling back:

We can undo this at any point up to allowing mail to arrive on the new host, by reverting the puppet change to migrate the VIPs and service. After that we need to allow for some mails to have been sent to exim but potentially not be delivered and we can deal with this as it comes.

Event Timeline

Wrote this in tech news:

Mailing lists will be unavailable for roughly two hours on Tuesday 10:00 UTC to 12:00 UTC. This is facilitate migration to a new server and upgrade its software. [9]

@Quiddity please edit mercilessly :)

eoghan triaged this task as High priority.Fri, Jun 14, 3:21 PM
eoghan moved this task from Incoming to Work in Progress on the collaboration-services board.

Icinga downtime and Alertmanager silence (ID=f70cad25-fba3-40c1-a3c3-abe8534eca40) set by eoghan@cumin1002 for 2:00:00 on 3 host(s) and their services with reason: Mailman migration

lists[1001,1004,2001].wikimedia.org

Change #1047049 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] lists: Remove service IPs from lists1004

https://gerrit.wikimedia.org/r/1047049

Change #1047049 merged by EoghanGaffney:

[operations/puppet@production] lists: Remove service IPs from lists1004

https://gerrit.wikimedia.org/r/1047049

Change #1047054 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] lists: Update DNS records to use host IP for lists1004

https://gerrit.wikimedia.org/r/1047054

Change #1047054 merged by EoghanGaffney:

[operations/dns@master] lists: Update DNS records to use host IP for lists1004

https://gerrit.wikimedia.org/r/1047054

Icinga downtime and Alertmanager silence (ID=33783771-f385-4d8a-9005-972d47cc403c) set by eoghan@cumin1002 for 1:00:00 on 3 host(s) and their services with reason: Mailman migration

lists[1001,1004,2001].wikimedia.org
eoghan updated the task description. (Show Details)