Page MenuHomePhabricator

Upgrade MediaWiki clusters to Debian Buster (debian 10)
Open, HighPublic

Description

Our 4 MediaWiki clusters, application, api, jobrunners/videoscalers, parsoid need to be migrated to Debian Buster.

progress: see https://docs.google.com/spreadsheets/d/1Ris18-joRFfd3OHjGJIraVUk-bpmIRORsPoms9D7BcM/edit?usp=sharing

Provisional plan for the migration:

  • Upgrade all current stretch servers to ICU 63 T264991
  • Rebuild all our php-7.2 packages for Debian Buster (buster-wikimedia)
    • php7.2-cli
    • php7.2-common
    • php7.2-curl
    • php7.2-dba
    • php7.2-fpm
    • php7.2-gd
    • php7.2-gmp
    • php7.2-mysql
    • php7.2-opcache
    • php7.2-phpdbg
    • php7.2-readline
    • php7.2-xml
  • Build missing packages for Buster
    • ploticus
    • prometheus-nutcracker-exporter
    • prometheus-php-fpm-exporter
  • Fix puppet code to support Buster
    • ttf-alee replaced with fonts-alee
    • ttf-wqy-zenhei replaced with fonts-wqy-zenhei
    • code to add PHP72 component on buster
  • Reimage mwdebug1001 to buster OR introduce mwdebug1003, so not to mess with development testing
    • first iteration done with testvm1001, decom'ed again
    • mwdebug1003 to be introduced early December (T267248)
    • add PHP72 APT component on mwdebug1003
    • Reimage parse2001 to buster (parsoid)
    • Reimage mw2243 to buster (jobrunner)
  • Reimage mw1265 to Buster (weight=5)

Q3

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+0 -1
operations/puppetproduction+0 -43
operations/puppetproduction+3 -4
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -24
operations/puppetproduction+0 -121
operations/puppetproduction+0 -150
operations/puppetproduction+0 -4
operations/puppetproduction+9 -2
operations/puppetproduction+7 -1
operations/puppetproduction+0 -2
operations/puppetproduction+0 -1
operations/puppetproduction+2 -0
operations/debs/prometheus-php-fpm-exportermaster+14 -2
operations/puppetproduction+21 -28
operations/puppetproduction+5 -1
operations/puppetproduction+2 -2
operations/puppetproduction+8 -1
operations/puppetproduction+2 -1
operations/puppetproduction+5 -0
operations/puppetproduction+0 -8
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
Stalledtstarling
StalledNone
StalledNone
OpenNone
StalledNone
StalledNone
StalledNone
StalledNone
OpenNone
OpenNone
ResolvedJdforrester-WMF
OpenNone
OpenNone
Resolvedhashar
ResolvedJdforrester-WMF
ResolvedLadsgroup
ResolvedMoritzMuehlenhoff
Resolvedjijiki
ResolvedMoritzMuehlenhoff
ResolvedTrizek-WMF
ResolvedDzahn
ResolvedGilles
StalledDzahn
ResolvedRequestPapaul
Resolvedjijiki
DeclinedNone
ResolvedDzahn
OpenNone
ResolvedDzahn
ResolvedPapaul
ResolvedCmjohnson
Opendancy
OpenNone
ResolvedRequestCmjohnson
ResolvedRequestPapaul
ResolvedAndrew
ResolvedArielGlenn
ResolvedDzahn
ResolvedLegoktm
OpenNone
ResolvedPapaul
ResolvedDzahn
DeclinedGilles
ResolvedVolans
ResolvedDzahn
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Completed auto-reimage of hosts:

['mw1344.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1343.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1350.eqiad.wmnet']

and were ALL successful.

Change 664856 merged by Dzahn:
[operations/puppet@production] mcrouter: move mcrouter proxy for codfw B3 to mw2258

https://gerrit.wikimedia.org/r/664856

Change 664692 merged by Dzahn:
[operations/puppet@production] mcrouter: move mcrouter proxy for D6 from mw1367 to mw1368

https://gerrit.wikimedia.org/r/664692

Change 664690 merged by Dzahn:
[operations/puppet@production] mcrouter: move mcrouter proxy for A7 from mw1270 to mw1271

https://gerrit.wikimedia.org/r/664690

Change 664898 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mcrouter: move mcrouter proxy for B6 from mw1287 to mw1288

https://gerrit.wikimedia.org/r/664898

Change 664859 merged by Dzahn:
[operations/puppet@production] mcrouter: move mcrouter proxy for codfw C3 to mw2337

https://gerrit.wikimedia.org/r/664859

Change 664898 merged by Dzahn:
[operations/puppet@production] mcrouter: move mcrouter proxy for B6 from mw1287 to mw1288

https://gerrit.wikimedia.org/r/664898

Change 664691 merged by Dzahn:
[operations/puppet@production] mcrouter: move mcrouter proxy for C6 from mw1320 to mw1321

https://gerrit.wikimedia.org/r/664691

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1341.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102191705_dzahn_13294_mw1341_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1367.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102191710_dzahn_16276_mw1367_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw2272.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102191714_dzahn_16903_mw2272_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mw2272.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1341.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1367.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1261.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102191850_dzahn_7290_mw1261_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1270.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102191852_dzahn_7564_mw1270_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1287.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102191853_dzahn_7816_mw1287_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw2257.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102191855_dzahn_8119_mw2257_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mw1261.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw2257.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1270.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1287.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1262.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102192055_dzahn_7050_mw1262_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1320.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102192056_dzahn_7231_mw1320_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw2336.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102192101_dzahn_7945_mw2336_codfw_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1340.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102192104_dzahn_8364_mw1340_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw2336.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1339.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102192155_dzahn_19780_mw1339_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1262.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1320.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1333.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102192213_dzahn_26628_mw1333_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1340.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1342.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102192218_dzahn_27282_mw1342_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1317.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102192233_dzahn_29483_mw1317_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1339.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1333.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1342.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1317.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1316.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102221850_dzahn_17022_mw1316_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1315.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102221855_dzahn_21586_mw1315_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1349.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102221947_dzahn_7004_mw1349_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1316.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1315.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1314.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102222015_dzahn_4645_mw1314_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1312.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102222019_dzahn_8398_mw1312_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1349.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1279.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102222029_dzahn_17792_mw1279_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1314.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1312.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1279.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1286.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102222221_dzahn_30877_mw1286_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1410.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102222223_dzahn_620_mw1410_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1412.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102222224_dzahn_1715_mw1412_eqiad_wmnet.log.

@MoritzMuehlenhoff do you think it makes sense to keep 1 api and 1 app in stretch a bit longer as to keep comparing performance? IIRC there might be some upcoming perf optimisations on the mediawiki side.

Completed auto-reimage of hosts:

['mw1410.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1412.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1286.eqiad.wmnet']

and were ALL successful.

Change 635108 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] tcpircbot: allow deploy1002/2002, do not allow deploy1001/2001

https://gerrit.wikimedia.org/r/635108

Change 635108 merged by Dzahn:
[operations/puppet@production] tcpircbot: allow deploy1002/2002, do not allow deploy1001/2001

https://gerrit.wikimedia.org/r/635108

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['parse2001.codfw.wmnet', 'parse2002.codfw.wmnet', 'parse2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103291305_jiji_19021.log.

Change 675506 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):
[operations/puppet@production] install_server: switch parsoid servers to buster

https://gerrit.wikimedia.org/r/675506

Change 675506 merged by Effie Mouzeli:
[operations/puppet@production] install_server: switch parsoid servers to buster

https://gerrit.wikimedia.org/r/675506

I have reimaged parse2001 as a test, and it appears that puppet is unable to run successfully because:

Error: Execution of '/usr/bin/scap deploy-local --repo parsoid/deploy -D log_json:False' returned 70: 15:19:26 Fetch from: http://deploy1001.eqiad.wmnet/parsoid/deploy/.git
15:19:26 Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 347, in run
    exit_status = app.main(app.extra_arguments)
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 147, in main
    getattr(self, stage)()
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 291, in fetch
    git.fetch(self.context.cache_dir, git_remote)
  File "/usr/lib/python2.7/dist-packages/scap/git.py", line 374, in fetch
    git.clone(*cmd)
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 1428, in __call__
    return RunningCommand(cmd, call_args, stdin, stdout, stderr)
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 775, in __init__
    self.wait()
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 793, in wait
    self.handle_command_exit_code(exit_code)
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 816, in handle_command_exit_code
    raise exc
ErrorReturnCode_128:

  RAN: /usr/bin/git clone --jobs 46 http://deploy1001.eqiad.wmnet/parsoid/deploy/.git /srv/deployment/parsoid/deploy-cache/cache

  STDOUT:


  STDERR:
Cloning into '/srv/deployment/parsoid/deploy-cache/cache'...
fatal: unable to access 'http://deploy1001.eqiad.wmnet/parsoid/deploy/.git/': Could not resolve host: deploy1001.eqiad.wmnet

15:19:26 deploy-local failed: <ErrorReturnCode_128>

  RAN: /usr/bin/git clone --jobs 46 http://deploy1001.eqiad.wmnet/parsoid/deploy/.git /srv/deployment/parsoid/deploy-cache/cache

  STDOUT:


  STDERR:
Cloning into '/srv/deployment/parsoid/deploy-cache/cache'...
fatal: unable to access 'http://deploy1001.eqiad.wmnet/parsoid/deploy/.git/': Could not resolve host: deploy1001.eqiad.wmnet


Error: /Stage[main]/Parsoid/Service::Node[parsoid]/Scap::Target[parsoid/deploy]/Package[parsoid/deploy]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo parsoid/deploy -D log_json:False' returned 70: 15:19:26 Fetch from: http://deploy1001.eqiad.wmnet/parsoid/deploy/.git
15:19:26 Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 347, in run
    exit_status = app.main(app.extra_arguments)
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 147, in main
    getattr(self, stage)()
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 291, in fetch
    git.fetch(self.context.cache_dir, git_remote)
  File "/usr/lib/python2.7/dist-packages/scap/git.py", line 374, in fetch
    git.clone(*cmd)
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 1428, in __call__
    return RunningCommand(cmd, call_args, stdin, stdout, stderr)
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 775, in __init__
    self.wait()
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 793, in wait
    self.handle_command_exit_code(exit_code)
  File "/usr/lib/python2.7/dist-packages/scap/sh.py", line 816, in handle_command_exit_code
    raise exc
ErrorReturnCode_128:

  RAN: /usr/bin/git clone --jobs 46 http://deploy1001.eqiad.wmnet/parsoid/deploy/.git /srv/deployment/parsoid/deploy-cache/cache

  STDOUT:


  STDERR:
Cloning into '/srv/deployment/parsoid/deploy-cache/cache'...
fatal: unable to access 'http://deploy1001.eqiad.wmnet/parsoid/deploy/.git/': Could not resolve host: deploy1001.eqiad.wmnet

15:19:26 deploy-local failed: <ErrorReturnCode_128>

  RAN: /usr/bin/git clone --jobs 46 http://deploy1001.eqiad.wmnet/parsoid/deploy/.git /srv/deployment/parsoid/deploy-cache/cache

  STDOUT:


  STDERR:
Cloning into '/srv/deployment/parsoid/deploy-cache/cache'...
fatal: unable to access 'http://deploy1001.eqiad.wmnet/parsoid/deploy/.git/': Could not resolve host: deploy1001.eqiad.wmnet


Notice: /Stage[main]/Parsoid/Service::Node[parsoid]/Base::Service_unit[parsoid]/File[/lib/systemd/system/parsoid.service]: Dependency Package[parsoid/deploy] has failures: true
Warning: /Stage[main]/Parsoid/Service::Node[parsoid]/Base::Service_unit[parsoid]/File[/lib/systemd/system/parsoid.service]: Skipping because of failed dependencies
Warning: /Stage[main]/Parsoid/Service::Node[parsoid]/Base::Service_unit[parsoid]/Exec[systemd reload for parsoid]: Skipping because of failed dependencies
Warning: /Stage[main]/Parsoid/Service::Node[parsoid]/Base::Service_unit[parsoid]/Service[parsoid]: Skipping because of failed dependencies
Notice: Applied catalog in 22.96 seconds

I have reimaged parse2001 as a test, and it appears that puppet is unable to run successfully because:

Error: Execution of '/usr/bin/scap deploy-local --repo parsoid/deploy -D log_json:False' returned 70: 15:19:26 Fetch from: http://deploy1001.eqiad.wmnet/parsoid/deploy/.git

@jijiki This is where the deploy1001 appears in:

deployment/parsoid/deploy-cache/.config:git_server: deploy1001.eqiad.wmnet

editing that file should fix it.

Other options from the past appear to include: "run scap with --refresh-config, delete cached .config file".

For more background also see T197470 , T197470#4414254 , T162814, T196663#4265139 afaict

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['parse2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104081320_jiji_21421.log.

Completed auto-reimage of hosts:

['parse2001.codfw.wmnet']

and were ALL successful.

Change 680483 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: switch mw1307 to use buster installer

https://gerrit.wikimedia.org/r/680483

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1402.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104162338_dzahn_27978_mw1402_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1403.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104162338_dzahn_28020_mw1403_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1307.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104162340_dzahn_28210_mw1307_eqiad_wmnet.log.

Change 680483 merged by Dzahn:

[operations/puppet@production] DHCP: switch mw1307 to use buster installer

https://gerrit.wikimedia.org/r/680483

Remaining 3 special cases kept on stretch now reimaged to buster as well.

Decom'ed mwdebug1003 VM.

Everything here is completely donenow... except mwmaint1002. Which will happen during the DC switchover.

Completed auto-reimage of hosts:

['mw1403.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1402.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1307.eqiad.wmnet']

and were ALL successful.