Page MenuHomePhabricator

Upgrade mw* servers to Debian Stretch (using HHVM)
Closed, ResolvedPublic

Description

Bug to track the upgrade of the MediaWiki servers from Debian 8 Jessie to Debian 9 Stretch. It consists of:

The following preliminary steps need to be fulfilled:

  • Build HHVM for stretch-wikimedia
  • Build HHVM extensions for stretch-wikimedia (luasandbox, tidy, wikidiff2)
  • ICU has changed it's ABI again (libicu52 in jessie, libicu57 in stretch), we could deploy a backport of libicu57 for jessie and do the migration there

Lilypond is not in stretch, but can be installed from stretch-backports.

1These clusters are complete:
2mwdebug servers
3application servers
4API servers
5job runners

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Legoktm subscribed.

I updated the steps based on the plan to use PHP 7 instead of HHVM.

I updated the steps based on the plan to use PHP 7 instead of HHVM.

There is no way we'll embark in the double migration at the same time.

Upgrading to stretch while remaining on HHVM is a pretty straightforward task ops can perform "in the background", while upgrading to stretch AND php7 will be a much larger project that will require resources we don't currently have dedicated to it; for reference, we had pitched this as a project for the current annual plan but it didn't make the cut, and if we want to perform this transition, we'll have to drop other things.

So a logical sequence of events I see is:

  • Migrate ICU version (this *is* painful and user noticeable, and will need coordination)
  • We upgrade all mw* servers to stretch, keep using HHVM
  • Once, or if, proper resources and a timeline are set, we swap HHVM for PHP7 on stretch.

That will need a completely separated ticket.

I updated the steps based on the plan to use PHP 7 instead of HHVM.

There is no way we'll embark in the double migration at the same time.

Upgrading to stretch while remaining on HHVM is a pretty straightforward task ops can perform "in the background", while upgrading to stretch AND php7 will be a much larger project that will require resources we don't currently have dedicated to it; for reference, we had pitched this as a project for the current annual plan but it didn't make the cut, and if we want to perform this transition, we'll have to drop other things.

So a logical sequence of events I see is:

  • Migrate ICU version (this *is* painful and user noticeable, and will need coordination)
  • We upgrade all mw* servers to stretch, keep using HHVM
  • Once, or if, proper resources and a timeline are set, we swap HHVM for PHP7 on stretch.

That will need a completely separated ticket.

Understood, and undid my changes.

I'll take care of builds for stretch-wikimedia

Change 384713 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Fix setup of libapache2-mod-security2 on stretch

https://gerrit.wikimedia.org/r/384713

Change 384713 merged by Muehlenhoff:
[operations/puppet@production] Fix setup of libapache2-mod-security2 on stretch

https://gerrit.wikimedia.org/r/384713

mw2246 today reported a failure in logrotate:

/etc/cron.daily/logrotate:
Job for apache2.service failed because the control process exited with error code.
See "systemctl status apache2.service" and "journalctl -xe" for details.
error: error running shared postrotate script for '/var/log/apache2/*.log '
run-parts: /etc/cron.daily/logrotate exited with return code 1

The only useful thing that I found in the logs is the following:

root@mw2246:/var/log# grep apache2 syslog.1
Dec 27 06:25:01 mw2246 systemd[17882]: apache2.service: Failed at step NAMESPACE spawning /usr/sbin/apachectl: No such file or directory
Dec 27 06:25:01 mw2246 systemd[1]: apache2.service: Control process exited, code=exited status=226

root@mw2246:/var/log# systemctl status apache2
● apache2.service - The Apache HTTP Server
   Loaded: loaded (/lib/systemd/system/apache2.service; enabled; vendor preset: enabled)
   Active: active (running) (Result: exit-code) since Mon 2017-12-18 15:24:04 UTC; 1 weeks 1 days ago
  Process: 17882 ExecReload=/usr/sbin/apachectl graceful (code=exited, status=226/NAMESPACE)
 Main PID: 8330 (apache2)
    Tasks: 55 (limit: 6144)
   CGroup: /system.slice/apache2.service
           ├─3610 /usr/sbin/apache2 -k start
           ├─3611 /usr/sbin/apache2 -k start
           └─8330 /usr/sbin/apache2 -k start

Dec 23 06:25:02 mw2246 systemd[1]: Reloaded The Apache HTTP Server.
Dec 24 06:25:02 mw2246 systemd[1]: Reloading The Apache HTTP Server.
Dec 24 06:25:02 mw2246 systemd[1]: Reloaded The Apache HTTP Server.
Dec 25 06:25:02 mw2246 systemd[1]: Reloading The Apache HTTP Server.
Dec 25 06:25:02 mw2246 systemd[1]: Reloaded The Apache HTTP Server.
Dec 26 06:25:02 mw2246 systemd[1]: Reloading The Apache HTTP Server.
Dec 26 06:25:02 mw2246 systemd[1]: Reloaded The Apache HTTP Server.
Dec 27 06:25:01 mw2246 systemd[1]: Reloading The Apache HTTP Server.
Dec 27 06:25:01 mw2246 systemd[1]: apache2.service: Control process exited, code=exited status=226
Dec 27 06:25:01 mw2246 systemd[1]: Reload failed for The Apache HTTP Server.
Krinkle renamed this task from Migration of mw* servers to stretch to Upgrade mw* servers to Debian Stretch (using HHVM).Jan 10 2018, 10:58 PM
Krinkle updated the task description. (Show Details)

Change 425269 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Reimage mw1265 with stretch

https://gerrit.wikimedia.org/r/425269

Change 425269 merged by Muehlenhoff:
[operations/puppet@production] Reimage mw1265 with stretch

https://gerrit.wikimedia.org/r/425269

Change 425772 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Reimage mw1279 (API canary) with stretch

https://gerrit.wikimedia.org/r/425772

Change 425772 merged by Muehlenhoff:
[operations/puppet@production] Reimage mw1279 (API canary) with stretch

https://gerrit.wikimedia.org/r/425772

Mentioned in SAL (#wikimedia-operations) [2018-04-13T09:03:18Z] <moritzm> reimaging mw1276-mw1278 to stretch (T174431)

Mentioned in SAL (#wikimedia-operations) [2018-04-13T10:59:54Z] <moritzm> reimaging mw1261-mw1264 to stretch (T174431)

Change 427608 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Switch all mw hosts to stretch

https://gerrit.wikimedia.org/r/427608

Change 427608 merged by Muehlenhoff:
[operations/puppet@production] Switch all mw hosts to stretch

https://gerrit.wikimedia.org/r/427608

Change 428923 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Reimage mwdebug servers with stretch

https://gerrit.wikimedia.org/r/428923

While investigating cronspam from recent reimages I took a look at mw1247 (for example) and noticed it has two disks but no software raid (T106381). I think we should also fix that while we're reimaging with Stretch anyways.

Change 428923 merged by Muehlenhoff:
[operations/puppet@production] Reimage mwdebug servers with stretch

https://gerrit.wikimedia.org/r/428923

I checked and all the mw22* are getting RAID due to this:

mw22*) echo partman/mw-raid1.cfg ;; \

But mw216* hosts like mw2163, 2164, 2165 are not getting RAID after reinstall. So you pointed this out in the right moment just when i started getting to those in the list.

Change 428961 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: let mw21[6-9] have software RAID

https://gerrit.wikimedia.org/r/428961

Change 428961 merged by Dzahn:
[operations/puppet@production] install_server: let mw21[6-9][0-9] have software RAID

https://gerrit.wikimedia.org/r/428961

Mentioned in SAL (#wikimedia-operations) [2018-04-26T02:05:17Z] <mutante> mw2163 through mw2166: since the wmf-auto-reimage failed after OS but before puppet run due to "Failed to puppet_generate_certs" i manually logged in with install-console and signed puppet certs (T174431)

All mwdebug servers are now running stretch.

Script wmf-auto-reimage was launched by dzahn on neodymium.eqiad.wmnet for hosts:

['mw2229.codfw.wmnet', 'mw2231.codfw.wmnet', 'mw2240.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201805031807_dzahn_2697.log.

Completed auto-reimage of hosts:

['mw2231.codfw.wmnet', 'mw2229.codfw.wmnet', 'mw2240.codfw.wmnet']

and were ALL successful.

All (regular) codfw appservers are now on stretch.

All application servers are now running stretch (excluding job runners and API servers).

All API servers in eqiad are now running stretch.

All job runners in eqiad and codfw are now running stretch.

All appservers are running stretch now. (one of them is broken, creating subtask)

This now just needs to stay open for deployment and maintenance servers.

Can "Deployment servers" be checked off since the two tasks next to it are resolved?

Can "Deployment servers" be checked off since the two tasks next to it are resolved?

Thanks for the note, I just fixed that.

Can "Deployment servers" be checked off since the two tasks next to it are resolved?

Thanks for the note, I just fixed that.

np~

MoritzMuehlenhoff claimed this task.
MoritzMuehlenhoff updated the task description. (Show Details)

Script runners are now also migrated to stretch, closing the task.