Page MenuHomePhabricator

Reallocate former image scalers
Closed, ResolvedPublic

Description

We have two hosts in eqiad (mw1297 and mw1298) and four hosts in codfw (mw2150, mw2151, mw2244 and mw2245) which were formerly used as image scalers. When the current HHVM/stretch migration (and ideally the merge of job runners/video scalers) is completed, we can repurpose them for other mw* roles (and since they are currently unused, maybe also use the opportunity to move them to other racks if that helps balancing rows).

  • mw1297 reinstalled
  • mw1297 reallocated -> API
  • mw1298 reinstalled
  • mw1298 reallocated -> jobrunner
  • mw2150 reinstalled
  • mw2150 reallocated -> jobrunner
  • mw2151 reinstalled (was jessie unlike others)
  • mw2151 reallocated -> jobrunner, added to conftool after it was missing
  • mw2244 reinstalled
  • mw2244 reallocated -> API
  • mw2245 reinstalled
  • mw2245 reallocated -> API

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2018-10-11T22:30:02Z] <mutante> mwmaint1001 - shutting down after final backup of /home, renaming back to mw1297 in DNS and DHCP, and reinstalling (T192457)

Change 465689 merged by Dzahn:
[operations/dns@master] Revert "rename wmf6936 from mw1297 to mwmaint1001"

https://gerrit.wikimedia.org/r/465689

Mentioned in SAL (#wikimedia-operations) [2018-10-11T22:50:35Z] <mutante> netbox - renamed mwmaint1001 to mw1279, changed status to inventory, renamed in DNS - T192457

Mentioned in SAL (#wikimedia-operations) [2018-10-11T22:53:17Z] <mutante> netbox - correction, mwmaint1001 to status "Staged", following new lifecycle docs T192457

Script wmf-auto-reimage was launched by dzahn on neodymium.eqiad.wmnet for hosts:

['mw1297.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201810112309_dzahn_14010.log.

Script wmf-auto-reimage was launched by dzahn on neodymium.eqiad.wmnet for hosts:

['mw1297.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201810112318_dzahn_16644.log.

mw1297: done, renamed in DNS/DHCP, reinstalled, in Icinga again, renamed in netbox, changed netbox status to "Staged" per new lifecycle docs

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=mw1297

https://netbox.wikimedia.org/dcim/devices/653/

[mw1297:~] $ uptime
23:51:06 up 1 min,

Completed auto-reimage of hosts:

['mw1297.eqiad.wmnet']

and were ALL successful.

Change 465686 merged by Dzahn:
[operations/puppet@production] network::constants: remove mwmaint1001

https://gerrit.wikimedia.org/r/465686

Change 466947 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: fix mwmaint1001 -> mw1297 fixed address

https://gerrit.wikimedia.org/r/466947

Change 466947 merged by Dzahn:
[operations/puppet@production] DHCP: fix mwmaint1001 -> mw1297 fixed address

https://gerrit.wikimedia.org/r/466947

Mentioned in SAL (#wikimedia-operations) [2018-10-16T08:42:03Z] <moritzm> removed mwmaint1001 from debmonitor (T192457)

Change 465685 merged by Marostegui:
[operations/puppet@production] mariadb: remove mwmaint1001 from prod-m5 SQL grants

https://gerrit.wikimedia.org/r/465685

@Joe I think you have a preference already what these should be used for, right?

Dzahn subscribed.

I had this to get former mwmaint1001 back into the "spare" pool. That is done. Happy to also help reinstalling the others but you know which role you wanted them for. Feel free to assign back after commenting.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['mw2151.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201901092334_dzahn_58052.log.

Mentioned in SAL (#wikimedia-operations) [2019-01-09T23:39:58Z] <mutante> mw2151 - change netbox status from active to staged - it's not actually active, it's role(spare) and was jessie (T192457)

Completed auto-reimage of hosts:

['mw2151.codfw.wmnet']

and were ALL successful.

Change 483476 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add mw2151 as another jobrunner host

https://gerrit.wikimedia.org/r/483476

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['mw1298.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201901102157_dzahn_89266.log.

Completed auto-reimage of hosts:

['mw1298.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['mw2244.codfw.wmnet', 'mw2245.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201901102307_dzahn_106668.log.

Completed auto-reimage of hosts:

['mw2244.codfw.wmnet', 'mw2245.codfw.wmnet']

and were ALL successful.

Change 483476 merged by Dzahn:
[operations/puppet@production] site: add mw2151 as another jobrunner host

https://gerrit.wikimedia.org/r/483476

Change 485968 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: mw2151 should not use spare role anymore

https://gerrit.wikimedia.org/r/485968

Change 485968 merged by Dzahn:
[operations/puppet@production] site: mw2151 should not use spare role anymore

https://gerrit.wikimedia.org/r/485968

@Dzahn https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/485968/ removed the spare role from mw2151,but the host is still installed with role(spare) and puppet is failing:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret mcrouter/mw2151.codfw.wmnet/mw2151.codfw.wmnet.crt.pem at /etc/puppet/modules/profile/manifests/mediawiki/mcrouter_wancache.pp:70:24 on node mw2151.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2019-02-25T08:49:09Z] <_joe_> generating mcrouter certificate for mw2151 T192457

@Dzahn https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/485968/ removed the spare role from mw2151,but the host is still installed with role(spare) and puppet is failing:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret mcrouter/mw2151.codfw.wmnet/mw2151.codfw.wmnet.crt.pem at /etc/puppet/modules/profile/manifests/mediawiki/mcrouter_wancache.pp:70:24 on node mw2151.codfw.wmnet

Just solved by adding the certificate, instructions are at https://wikitech.wikimedia.org/wiki/Mcrouter

Please note we may steal mw2245 for thumbor1005 use on T218323

During the HHVM updates I noticed that mw2151 is in site.pp as a jobrunner, but not listed in conftool-data.

Change 504791 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/mw: assign spare mw1297,mw1298 as API servers

https://gerrit.wikimedia.org/r/504791

Change 504793 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] conftool: add mw2151 as a jobrunner

https://gerrit.wikimedia.org/r/504793

During the HHVM updates I noticed that mw2151 is in site.pp as a jobrunner, but not listed in conftool-data.

https://gerrit.wikimedia.org/r/504793

Change 504794 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/mw/conftool: assign mw2150 as jobrunner, mw22244 as API server

https://gerrit.wikimedia.org/r/504794

Change 504793 merged by Dzahn:
[operations/puppet@production] conftool: add mw2151 as a jobrunner

https://gerrit.wikimedia.org/r/504793

During the HHVM updates I noticed that mw2151 is in site.pp as a jobrunner, but not listed in conftool-data.

Thanks! Added to conftool-data now.

Please note we may steal mw2245 for thumbor1005 use on T218323

@RobH @jijiki I included that in my gerrit change https://gerrit.wikimedia.org/r/c/operations/puppet/+/504794 but as pointed out by joe while reviewing it, that is mixing eqiad and codfw?

Please note we may steal mw2245 for thumbor1005 use on T218323

@RobH @jijiki I included that in my gerrit change https://gerrit.wikimedia.org/r/c/operations/puppet/+/504794 but as pointed out by joe while reviewing it, that is mixing eqiad and codfw?

Indeed, I was wrong and we cannot do that! We'll have to pick another mw system to steal for thumbor. Good catch!

13:40 <+icinga-wm> RECOVERY - mediawiki-installation DSH group on mw2151 is OK: OK https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist
13:49 < mutante> !log mw2151 - scap pull
..
14:02 <+logmsgbot> !log dzahn@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2151.codfw.wmnet,cluster=jobrunner,service=nginx

mw2150 was not in this ticket until now, but it was in site.pp as another spare under the "Former imagescalers" section. added to ticket. checking if it has been reinstalled.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['mw2150.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201904181840_dzahn_4813.log.

Dzahn updated the task description. (Show Details)

Completed auto-reimage of hosts:

['mw2150.codfw.wmnet']

and were ALL successful.

Change 505315 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[labs/private@master] add fake certs for mw2150,mw2244,mw2245

https://gerrit.wikimedia.org/r/505315

Change 505315 merged by Dzahn:
[labs/private@master] add fake certs for mw2150,mw2244,mw2245

https://gerrit.wikimedia.org/r/505315

Change 504794 merged by Dzahn:
[operations/puppet@production] site/conftool: assign mw2150 jobrunner, mw2244,mw2245 API servers

https://gerrit.wikimedia.org/r/504794

17:42 < mutante> !log mw2150,mw2244,mw2245: initial puppet run, added to mw roles

18:53 < mutante> !log mw2244,mw2245,mw2150 - rebooting for known nutcracker issue after first install

18:53 <+icinga-wm> RECOVERY - Check systemd state on mw2244 is OK: OK - running: The system is fully operational

18:55 < mutante> !log mw2244,mw2245,mw2150 - scap pull

19:10 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2150.codfw.wmnet,service=nginx,cluster=jobrunner

19:16 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2244.codfw.wmnet,cluster=api_appserver

19:17 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2245.codfw.wmnet,cluster=api_appserver

Mentioned in SAL (#wikimedia-operations) [2019-04-23T23:25:49Z] <mutante> generating mcrouter certs for appservers, added mw1297.eqiad.wmnet (T192457)

Change 504791 merged by Dzahn:
[operations/puppet@production] site/mw/conftool: assign spare mw1297 as API server

https://gerrit.wikimedia.org/r/504791

Mentioned in SAL (#wikimedia-operations) [2019-04-24T18:47:22Z] <mutante> pooled mw1297 as a new API server (T192457)

Dzahn changed the task status from Open to Stalled.Apr 24 2019, 6:49 PM

Ticket is done besides one check box and that is T215332 unless a different server is used, making sure in T215332#5133171.

jijiki lowered the priority of this task from High to Medium.Jun 24 2019, 3:32 PM
jijiki moved this task from Doing 😎 to API Gateway 🥌 on the serviceops board.

@Dzahn We will be moving Thumbor to k8s T233196, we can repurpose the spare server for something else:)

Change 537658 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] allocate mw1298 as a jobrunner

https://gerrit.wikimedia.org/r/537658

Change 537658 merged by Dzahn:
[operations/puppet@production] site: allocate mw1298 as a jobrunner, add to conftool

https://gerrit.wikimedia.org/r/537658

Mentioned in SAL (#wikimedia-operations) [2019-09-19T17:55:11Z] <mutante> puppetmaster1001 - add mcrouter cert for mw1298.eqiad.wmnet (T192457)

Dzahn claimed this task.
Dzahn updated the task description. (Show Details)