Page MenuHomePhabricator

(Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet
Closed, ResolvedPublic

Description

This task tracks the setup of 20 new parse nodes in codfw:parse200[1-20]

Hostnames: parse200[1-20]
Racking Proposal:

RackA 5A 8B 5B 8C 6C 5D 5D 8
mw servers32232323

Networking/Subnet/VLAN/IP: single 1G production network connection.

  • - receive in system on procurement task T231255
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan) end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Papaul triaged this task as Medium priority.Jan 18 2020, 12:21 AM
Papaul updated the task description. (Show Details)

parse200[1-7] racked and Netbox updated

@jijiki Are these going to be parsoid/PHP appservers? But we don't want to call them mw? Let's add the new name on https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions please.

@jijiki Are these going to be parsoid/PHP appservers? But we don't want to call them mw? Let's add the new name on https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions please.

Yes, I have updated wikitech

wiki_willy renamed this task from codfw: rack/setup/install parse200[1-20].codfw.wmnet to (Need by: TBD) codfw: rack/setup/install parse200[1-20].codfw.wmnet.Feb 26 2020, 1:47 AM

Change 575034 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DHCP: Add mgmt and production DNS for parse200[1-20]

https://gerrit.wikimedia.org/r/575034

There might be some uncertainty about naming here.

https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions says they are called "parsoid" but the ticket (and now DHCP change ) says they are "parse".

Please make sure with @jijiki which is the right one.

Edited wiki to match parse*. Let's go ahead with parse in the interest of time. Would be quite some work to rename all for that.

Change 575034 merged by Papaul:
[operations/dns@master] DNS: Add mgmt and production DNS for parse200[1-20]

https://gerrit.wikimedia.org/r/575034

Change 575071 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Add MAC address for parse200[1-20]

https://gerrit.wikimedia.org/r/575071

Change 575071 merged by Dzahn:
[operations/puppet@production] DHCP: Add MAC address for parse200[1-20]

https://gerrit.wikimedia.org/r/575071

Change 575085 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] netboot: add parse* to use raid1-2dev partman recipe

https://gerrit.wikimedia.org/r/575085

Change 575085 merged by Dzahn:
[operations/puppet@production] netboot: add parse* to use raid1-2dev partman recipe

https://gerrit.wikimedia.org/r/575085

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002262201_pt1979_17219_parse2001_codfw_wmnet.log.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2002.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002262210_pt1979_18578_parse2002_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2001.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['parse2002.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002262232_pt1979_25945_parse2003_codfw_wmnet.log.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2004.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002262233_pt1979_25998_parse2004_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2003.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['parse2004.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2005.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002262300_pt1979_32678_parse2005_codfw_wmnet.log.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2006.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002262301_pt1979_32736_parse2006_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2006.codfw.wmnet']

Of which those FAILED:

['parse2006.codfw.wmnet']

Change 575100 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add new parsoid nodes with spare role

https://gerrit.wikimedia.org/r/575100

Completed auto-reimage of hosts:

['parse2005.codfw.wmnet']

and were ALL successful.

@Volans i ma trying the downtime command from cookbook to downtime a host before running the auto-reimage script i am getting the error below . What am I missing? Thanks

sudo cookbook sre.hosts.downtime -r new_install -t T243112 -H 1 parse2007.codfw.wmnet
START - Cookbook sre.hosts.downtime
Exception raised while executing cookbook sre.hosts.downtime:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 409, in _run
    ret = self.module.run(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/downtime.py", line 56, in run
    remote_hosts = spicerack.remote().query(args.query)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 323, in query
    return RemoteHosts(self._config, hosts, dry_run=self._dry_run)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 373, in __init__
    raise RemoteError('No hosts provided')
spicerack.remote.RemoteError: No hosts provided

@Volans i ma trying the downtime command from cookbook to downtime a host before running the auto-reimage script i am getting the error below . What am I missing? Thanks

Can't work like this because:

  • Icinga can't downtime hosts and services not yet defined (before the installation of a new host), but only already defined hosts and services
  • In this case above the cookbook fails because the host is not in PuppetDB hence the result for the query parse2007.codfw.wmnet is 0 hosts

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2009.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002270024_pt1979_17491_parse2009_codfw_wmnet.log.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2010.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002270025_pt1979_17609_parse2010_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2009.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2011.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002270055_pt1979_24577_parse2011_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2010.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['parse2011.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2012.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002270119_pt1979_30154_parse2012_codfw_wmnet.log.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2013.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002270127_pt1979_31123_parse2013_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2012.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['parse2013.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2014.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002270208_pt1979_6723_parse2014_codfw_wmnet.log.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2015.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002270208_pt1979_6750_parse2015_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2015.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2017.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002270231_pt1979_13452_parse2017_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2014.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2016.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002270233_pt1979_13691_parse2016_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2017.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['parse2016.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2018.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002270311_pt1979_21282_parse2018_codfw_wmnet.log.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2019.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002270312_pt1979_21883_parse2019_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2018.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2020.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002270335_pt1979_28858_parse2020_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2019.codfw.wmnet']

and were ALL successful.

All parse nodes are ready for service just missing parse200[7-8] i think the problem is a wrong mgmt password. I will look into this once at the DC tomorrow

Completed auto-reimage of hosts:

['parse2020.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2007.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002271607_pt1979_1664_parse2007_codfw_wmnet.log.

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

parse2008.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202002271608_pt1979_1927_parse2008_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2007.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['parse2008.codfw.wmnet']

and were ALL successful.

Papaul updated the task description. (Show Details)
Papaul added a subscriber: Joe.

@Dzahn @Joe all 20 servers ready for service

Change 575100 merged by Dzahn:
[operations/puppet@production] site: add new parsoid nodes with spare role

https://gerrit.wikimedia.org/r/575100

Change 578996 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] use new role(insetup) on a few hosts in setup

https://gerrit.wikimedia.org/r/578996

Change 578996 merged by Dzahn:
[operations/puppet@production] use new role(insetup) on a few hosts in setup

https://gerrit.wikimedia.org/r/578996

Change 579007 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add missing parse2010 to regex

https://gerrit.wikimedia.org/r/579007

Change 579007 merged by Dzahn:
[operations/puppet@production] site: add missing parse2010 to regex

https://gerrit.wikimedia.org/r/579007

akosiaris updated the task description. (Show Details)

Re-opening. This has been wrongly closed, the last 2 items in the check list have not been completed.

@akosiaris no need to reopen the task since this needs to be done by the service owner on another task and not on the racking/setup task. Once the server is in stage mode dc-ops can resolve the racking/setup task.

Thanks.

It should continue on T247441.

I agree with Papaul but we should probably update some template with the checkboxes somewhere.

Papaul updated the task description. (Show Details)

@akosiaris no need to reopen the task since this needs to be done by the service owner on another task and not on the racking/setup task. Once the server is in stage mode dc-ops can resolve the racking/setup task.

@Papaul Makes sense. I think we need to adapt the template a bit as @Dzahn mentioned. The main reason I reopened the task was that the last 2 bullet points

  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

appeared to not be done.

I 'd suggest we make the first bullet point something along the lines of:

  • Create handing off task to service implementers.
  • Change to 'staged' in netbox.

@akosiaris I do agree with the change. I will pass the suggestion to the other members of the team.

Thanks.