Page MenuHomePhabricator

Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM
Closed, ResolvedPublic

Description

This is the tracking task for the migration of mailman from sodium onto a production ganeti vm & current/stable release of mailman2.x.

The following people are involved in this project: @Dzahn, @JohnLewis, @RobH & @faidon.

#taskticketdone?
1request new VM for staging/testingT108065yes
2install jessie on new VMT108070yes
3let JohnLewis sign L2T108057yes
3give JohnLewis shell access on new VM and sudo to execute things as "list" and view log filesT108082yes
4basic semi-manual mailman 2.1.8 setup on new VMT108383yes
5setup rsyncd on fermium (via puppet) to be able to copy files directly without agent forwardingT109921yes
6export list configs and archives from sodium, rsync them all over to fermiumT108071yes
7write script to import listsT109922yes
8test importing of list configs and archives on fermium for all lists (public and private)T108073yes
9rename lists with invalid namesT109539, T109393yes
10move hardcoded IP configuration (server and service name) to hiera to be able to run more than 1 mailman instance from puppet roleT109624yes
11clean up mailman data directory on sodium (over 0.5 million held messages)T109838, T83967yes
12write this plan :)T109467yes
13go through all directories in /var/lib/mailman and decide whether they need to be imported or can be skippedT109399yes
14figure out which new service IP to use, v4 and v6, set it in hiera?T108080yes
15add public IP for fermium (DNS change, installserver/DHCP change)T109923yes
16reinstall OS (jessie) on fermiumT109924yes
17apply regular mailman role on fermiumT109925yes
18test ferm rules are sufficientT104980yes
19rsync all configs and archives one more timeT110129yes
20import all lists with the script we wrote for thatT110131no
21one day before: lower lists.wikimedia.org TTL to 5 minT110132yes
22announce scheduled downtime - need to debate and decide on a worst-case length.T110133yes
23right before the switch: lower TTL to 10 secondsT110135n/a
24hold lists.wikimedia.org with exim (disable puppet on sodium; apply locally rather via operations/puppet unless we want to hold all emails to fermium as well for 'safety'?)T110136invalid
25shut down mailman on sodiumT110137yes
26rsync one more time, this time only the diff since it was shutdownT110138yes
27rsync exim spool directoryT110440yes
28run ./bin/update and ./bin/check_permsT113020yes
29test sending individual mails from fermiumT110441yes
30switch over service IPT110139yes
31re-enable exim on fermiumT113045yes
31send follow-up email, announce changes with new mailman version if any that have user impact ?T110140yes
32profit? maybe - revert ideas for worst cases?yes

Not blockers, just follow-up:

33TTL back up to normal 1HT110141no
34shutdown sodium, celebrate "no more lucid", close all resolved ticketsT110142

Related Objects

StatusSubtypeAssignedTask
Resolvedfaidon
Resolvedfaidon
ResolvedLSobanski
Resolvedfaidon
ResolvedJanZerebecki
ResolvedDzahn
ResolvedDzahn
DuplicateNone
ResolvedDzahn
Resolved MZMcBride
ResolvedNone
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedRobH
ResolvedDzahn
ResolvedRobH
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
Resolved JohnLewis
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
DuplicateDzahn
ResolvedDzahn
ResolvedDzahn
DuplicateDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
InvalidDzahn
ResolvedDzahn
ResolvedDzahn
DeclinedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
Resolved JohnLewis
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Dzahn added a subtask: Restricted Task.Aug 5 2015, 5:56 PM
Dzahn renamed this task from Mailman Upgrade (Jessie & Mailman 2.x) to Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM.Aug 5 2015, 6:38 PM

The setup should be resilient to entropy starvation, so we can enable STARTTLS. I am not sure if this requires any explicit action on our part, but one possible requirement is to have haveged running on the virtualization host and the virtio-rng kernel module loaded in guests. This allows guests to use the host's /dev/random.

Dzahn removed a parent task: Restricted Task.Aug 21 2015, 7:35 PM
Dzahn added a subtask: Restricted Task.
Dzahn closed subtask Restricted Task as Resolved.Aug 28 2015, 6:38 PM
Dzahn updated the task description. (Show Details)

"Daniel Zahn 9:56 AM (31 minutes ago)

(back to just announcements on this list but this is one)

We have scheduled an upgrade of mailman (https://lists.wikimedia.org) for:

Wednesday, September 9, 2015 at 2:00:00 PM UTC ( 7:00 AM PDT, 16:00 CEST)"

Are you going to tell the readers??? To forestall all the 'The list is not working messages....

Regards, Richard

Are you going to tell the readers??? To forestall all the 'The list is not working messages....

While i understand this concern, mailing all 500+ lists with the same message seemed excessive to me.

Can we add checking web/email interface i18n encoding to the "it works" checklist? Caused some issues in the past (not only on our mailman install)

Can we add checking web/email interface i18n encoding to the "it works" checklist?

See T110131#1599716 for a related comment. I ran into issues with one of the listinfo templates (French) on the puppet level and fixed that by converting it to UTF-8. But just this one file. If you see more issues i'm glad if we can fix them but for the migration day specifically i just want to promise they are just like they were on sodium before. we can go from there. see current status on P1944

Are you going to tell the readers??? To forestall all the 'The list is not working messages....

While i understand this concern, mailing all 500+ lists with the same message seemed excessive to me.

T110133 is the related task. (I would have included the ambassadors list, FWIW).

Unfortunately the migration didn't work out this time. We made some last minute changes to the rsync/import scripts to use mv instead of rsync to make things faster, then messed up syntax slightly so a trailing / with the mv command meant things were moved in subdirectories where they shouldn't have. In the end we ran out of time to rsync it again before the end of the scheduled window.

Change 237276 had a related patch set uploaded (by Dzahn):
mailman: no more importing to an import dir

https://gerrit.wikimedia.org/r/237276

Change 237276 merged by Dzahn:
mailman: no more importing to an import dir

https://gerrit.wikimedia.org/r/237276

Dzahn updated the task description. (Show Details)

Migration is over, things seem working (receiving emails, tailing logs on fermium show a bit of activity), so I feel confident saying we're on Jessie now, on .18 and have an unused Lucid box in the datacenter. Daniel's final call for this ticks my boxes.

Dzahn closed subtask Restricted Task as Resolved.

all blockers closed. that closes the tracking ticket :)