Page MenuHomePhabricator

Reallocate former image scalers
Open, Stalled, HighPublic

Description

We have two hosts in eqiad (mw1297 and mw1298) and four hosts in codfw (mw2150, mw2151, mw2244 and mw2245) which were formerly used as image scalers. When the current HHVM/stretch migration (and ideally the merge of job runners/video scalers) is completed, we can repurpose them for other mw* roles (and since they are currently unused, maybe also use the opportunity to move them to other racks if that helps balancing rows).

  • mw1297 reinstalled
  • mw1297 reallocated -> API
  • mw1298 reinstalled
  • mw1298 reallocated -> thumbor -> T215332 (T221132)
  • mw2150 reinstalled
  • mw2150 reallocated -> jobrunner
  • mw2151 reinstalled (was jessie unlike others)
  • mw2151 reallocated -> jobrunner, added to conftool after it was missing
  • mw2244 reinstalled
  • mw2244 reallocated -> API
  • mw2245 reinstalled
  • mw2245 reallocated -> API

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 430518 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] rename wmf6936 from mw1297 to mwmaint1001

https://gerrit.wikimedia.org/r/430518

Change 430518 merged by Dzahn:
[operations/dns@master] rename wmf6936 from mw1297 to mwmaint1001

https://gerrit.wikimedia.org/r/430518

mwmaint1001 should be reinstalled as mw1297 and go back into the pool.

but this is after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/461492/ (and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465645/)

Change 465685 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mariadb: remove mwmaint1001 from prod-m5 SQL grants

https://gerrit.wikimedia.org/r/465685

Change 465686 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] network::constants: remove mwmaint1001

https://gerrit.wikimedia.org/r/465686

Change 465689 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] Revert "rename wmf6936 from mw1297 to mwmaint1001"

https://gerrit.wikimedia.org/r/465689

Change 465689 abandoned by Dzahn:
Revert "rename wmf6936 from mw1297 to mwmaint1001"

Reason:
cant rebase cleanly and for some reason "fatal: Couldn't find remote ref refs/changes/89/465689/2" for me right now

https://gerrit.wikimedia.org/r/465689

Change 465689 restored by Dzahn:
Revert "rename wmf6936 from mw1297 to mwmaint1001"

https://gerrit.wikimedia.org/r/465689

Change 466773 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] re-add mw1297 to site.pp and DHCP, formerly mwmaint1001

https://gerrit.wikimedia.org/r/466773

Change 466773 merged by Dzahn:
[operations/puppet@production] re-add mw1297 to site.pp and DHCP, remove mwmaint1001

https://gerrit.wikimedia.org/r/466773

Mentioned in SAL (#wikimedia-operations) [2018-10-11T22:30:02Z] <mutante> mwmaint1001 - shutting down after final backup of /home, renaming back to mw1297 in DNS and DHCP, and reinstalling (T192457)

Change 465689 merged by Dzahn:
[operations/dns@master] Revert "rename wmf6936 from mw1297 to mwmaint1001"

https://gerrit.wikimedia.org/r/465689

Mentioned in SAL (#wikimedia-operations) [2018-10-11T22:50:35Z] <mutante> netbox - renamed mwmaint1001 to mw1279, changed status to inventory, renamed in DNS - T192457

Mentioned in SAL (#wikimedia-operations) [2018-10-11T22:53:17Z] <mutante> netbox - correction, mwmaint1001 to status "Staged", following new lifecycle docs T192457

Script wmf-auto-reimage was launched by dzahn on neodymium.eqiad.wmnet for hosts:

['mw1297.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201810112309_dzahn_14010.log.

Script wmf-auto-reimage was launched by dzahn on neodymium.eqiad.wmnet for hosts:

['mw1297.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201810112318_dzahn_16644.log.

mw1297: done, renamed in DNS/DHCP, reinstalled, in Icinga again, renamed in netbox, changed netbox status to "Staged" per new lifecycle docs

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=mw1297

https://netbox.wikimedia.org/dcim/devices/653/

[mw1297:~] $ uptime
23:51:06 up 1 min,

Dzahn updated the task description. (Show Details)Oct 11 2018, 11:52 PM

Completed auto-reimage of hosts:

['mw1297.eqiad.wmnet']

and were ALL successful.

Dzahn updated the task description. (Show Details)Oct 11 2018, 11:53 PM

Change 465686 merged by Dzahn:
[operations/puppet@production] network::constants: remove mwmaint1001

https://gerrit.wikimedia.org/r/465686

Change 466947 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: fix mwmaint1001 -> mw1297 fixed address

https://gerrit.wikimedia.org/r/466947

Change 466947 merged by Dzahn:
[operations/puppet@production] DHCP: fix mwmaint1001 -> mw1297 fixed address

https://gerrit.wikimedia.org/r/466947

Mentioned in SAL (#wikimedia-operations) [2018-10-16T08:42:03Z] <moritzm> removed mwmaint1001 from debmonitor (T192457)

Change 465685 merged by Marostegui:
[operations/puppet@production] mariadb: remove mwmaint1001 from prod-m5 SQL grants

https://gerrit.wikimedia.org/r/465685

@Joe I think you have a preference already what these should be used for, right?

Dzahn reassigned this task from Dzahn to Joe.Oct 27 2018, 1:13 AM
Dzahn added a subscriber: Dzahn.

I had this to get former mwmaint1001 back into the "spare" pool. That is done. Happy to also help reinstalling the others but you know which role you wanted them for. Feel free to assign back after commenting.

jijiki added a subscriber: jijiki.Nov 2 2018, 1:44 PM

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['mw2151.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201901092334_dzahn_58052.log.

Mentioned in SAL (#wikimedia-operations) [2019-01-09T23:39:58Z] <mutante> mw2151 - change netbox status from active to staged - it's not actually active, it's role(spare) and was jessie (T192457)

Completed auto-reimage of hosts:

['mw2151.codfw.wmnet']

and were ALL successful.

Dzahn updated the task description. (Show Details)Jan 10 2019, 5:41 PM

Change 483476 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add mw2151 as another jobrunner host

https://gerrit.wikimedia.org/r/483476

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['mw1298.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201901102157_dzahn_89266.log.

Completed auto-reimage of hosts:

['mw1298.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['mw2244.codfw.wmnet', 'mw2245.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201901102307_dzahn_106668.log.

Completed auto-reimage of hosts:

['mw2244.codfw.wmnet', 'mw2245.codfw.wmnet']

and were ALL successful.

Dzahn updated the task description. (Show Details)Jan 11 2019, 12:12 AM

Change 483476 merged by Dzahn:
[operations/puppet@production] site: add mw2151 as another jobrunner host

https://gerrit.wikimedia.org/r/483476

Change 485968 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: mw2151 should not use spare role anymore

https://gerrit.wikimedia.org/r/485968

Change 485968 merged by Dzahn:
[operations/puppet@production] site: mw2151 should not use spare role anymore

https://gerrit.wikimedia.org/r/485968

@Dzahn https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/485968/ removed the spare role from mw2151,but the host is still installed with role(spare) and puppet is failing:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret mcrouter/mw2151.codfw.wmnet/mw2151.codfw.wmnet.crt.pem at /etc/puppet/modules/profile/manifests/mediawiki/mcrouter_wancache.pp:70:24 on node mw2151.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2019-02-25T08:49:09Z] <_joe_> generating mcrouter certificate for mw2151 T192457

Joe added a comment.Feb 25 2019, 8:55 AM

@Dzahn https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/485968/ removed the spare role from mw2151,but the host is still installed with role(spare) and puppet is failing:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret mcrouter/mw2151.codfw.wmnet/mw2151.codfw.wmnet.crt.pem at /etc/puppet/modules/profile/manifests/mediawiki/mcrouter_wancache.pp:70:24 on node mw2151.codfw.wmnet

Just solved by adding the certificate, instructions are at https://wikitech.wikimedia.org/wiki/Mcrouter

RobH added a subscriber: RobH.Mar 14 2019, 5:31 PM

Please note we may steal mw2245 for thumbor1005 use on T218323

During the HHVM updates I noticed that mw2151 is in site.pp as a jobrunner, but not listed in conftool-data.

Change 504791 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/mw: assign spare mw1297,mw1298 as API servers

https://gerrit.wikimedia.org/r/504791

Change 504793 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] conftool: add mw2151 as a jobrunner

https://gerrit.wikimedia.org/r/504793

During the HHVM updates I noticed that mw2151 is in site.pp as a jobrunner, but not listed in conftool-data.

https://gerrit.wikimedia.org/r/504793

Dzahn updated the task description. (Show Details)Apr 17 2019, 9:35 PM

Change 504794 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/mw/conftool: assign mw2150 as jobrunner, mw22244 as API server

https://gerrit.wikimedia.org/r/504794

Change 504793 merged by Dzahn:
[operations/puppet@production] conftool: add mw2151 as a jobrunner

https://gerrit.wikimedia.org/r/504793

During the HHVM updates I noticed that mw2151 is in site.pp as a jobrunner, but not listed in conftool-data.

Thanks! Added to conftool-data now.

Please note we may steal mw2245 for thumbor1005 use on T218323

@RobH @jijiki I included that in my gerrit change https://gerrit.wikimedia.org/r/c/operations/puppet/+/504794 but as pointed out by joe while reviewing it, that is mixing eqiad and codfw?

Dzahn updated the task description. (Show Details)Apr 18 2019, 5:25 PM
RobH added a comment.Apr 18 2019, 5:26 PM

Please note we may steal mw2245 for thumbor1005 use on T218323

@RobH @jijiki I included that in my gerrit change https://gerrit.wikimedia.org/r/c/operations/puppet/+/504794 but as pointed out by joe while reviewing it, that is mixing eqiad and codfw?

Indeed, I was wrong and we cannot do that! We'll have to pick another mw system to steal for thumbor. Good catch!

Dzahn added a comment.EditedApr 18 2019, 5:50 PM

13:40 <+icinga-wm> RECOVERY - mediawiki-installation DSH group on mw2151 is OK: OK https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist
13:49 < mutante> !log mw2151 - scap pull
..
14:02 <+logmsgbot> !log dzahn@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2151.codfw.wmnet,cluster=jobrunner,service=nginx

Dzahn updated the task description. (Show Details)Apr 18 2019, 6:04 PM
Dzahn updated the task description. (Show Details)Apr 18 2019, 6:17 PM

mw2150 was not in this ticket until now, but it was in site.pp as another spare under the "Former imagescalers" section. added to ticket. checking if it has been reinstalled.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['mw2150.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201904181840_dzahn_4813.log.

Dzahn updated the task description. (Show Details)Apr 18 2019, 6:44 PM
Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)Apr 18 2019, 6:58 PM

Completed auto-reimage of hosts:

['mw2150.codfw.wmnet']

and were ALL successful.

Change 505315 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[labs/private@master] add fake certs for mw2150,mw2244,mw2245

https://gerrit.wikimedia.org/r/505315

Change 505315 merged by Dzahn:
[labs/private@master] add fake certs for mw2150,mw2244,mw2245

https://gerrit.wikimedia.org/r/505315

Change 504794 merged by Dzahn:
[operations/puppet@production] site/conftool: assign mw2150 jobrunner, mw2244,mw2245 API servers

https://gerrit.wikimedia.org/r/504794

Dzahn added a comment.EditedApr 19 2019, 10:54 PM

17:42 < mutante> !log mw2150,mw2244,mw2245: initial puppet run, added to mw roles

18:53 < mutante> !log mw2244,mw2245,mw2150 - rebooting for known nutcracker issue after first install

18:53 <+icinga-wm> RECOVERY - Check systemd state on mw2244 is OK: OK - running: The system is fully operational

18:55 < mutante> !log mw2244,mw2245,mw2150 - scap pull

Dzahn updated the task description. (Show Details)Apr 19 2019, 11:17 PM

19:10 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2150.codfw.wmnet,service=nginx,cluster=jobrunner

19:16 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2244.codfw.wmnet,cluster=api_appserver

19:17 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2245.codfw.wmnet,cluster=api_appserver

Mentioned in SAL (#wikimedia-operations) [2019-04-23T23:25:49Z] <mutante> generating mcrouter certs for appservers, added mw1297.eqiad.wmnet (T192457)

Change 504791 merged by Dzahn:
[operations/puppet@production] site/mw/conftool: assign spare mw1297 as API server

https://gerrit.wikimedia.org/r/504791

Mentioned in SAL (#wikimedia-operations) [2019-04-24T18:47:22Z] <mutante> pooled mw1297 as a new API server (T192457)

Dzahn updated the task description. (Show Details)Apr 24 2019, 6:47 PM
Dzahn changed the task status from Open to Stalled.Apr 24 2019, 6:49 PM

Ticket is done besides one check box and that is T215332 unless a different server is used, making sure in T215332#5133171.

Dzahn updated the task description. (Show Details)Apr 24 2019, 6:50 PM
Dzahn moved this task from Backlog to Doing on the serviceops board.