Page MenuHomePhabricator

rack/setup/install ores2001-2009
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of 9 systems: ores2001-2009. These were ordered on T161723, and originally requested by @akosiaris on T142578.

The racking locations have been assumed by @RobH, and must be verified by @akosiaris before this can be properly handled by @Papaul.

Alex: Please confirm if these should be racked with horizontal spread (across racks/rows) or if there is some limitation to this service that requires a different kind of setup. I've (@RobH) have assumed that we want to spread these out as much as possible. Also please review hostname proposal of oresXXXX. If this isn't going to work, please adjust this task's description with the acceptable hostname, and update naming conventions.

Racking Proposal: There are 9 systems, place them evenly spread between racks and rows. This means 2 per row, with one row having 3 instead of 2. Please place these in 1Gbit networking racks and otherwise place where you have the most power, network, and rackspace availability per row. None of these new hosts should be in the same rack as one another, to increase horizontal redundancy.

ores2001

  • - system received in from procurement task T161723.
  • - system racked according to racking proposal.
  • - bios/drac/serial setup/testing
  • - mgmt and production dns entries added (internal vlan)
  • - sub-task created in networking netops SRE project for network port setup, include all port info
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

ores2002

  • - system received in from procurement task T161723.
  • - system racked according to racking proposal.
  • - bios/drac/serial setup/testing
  • - mgmt and production dns entries added (internal vlan)
  • - sub-task created in networking netops SRE project for network port setup, include all port info
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

ores2003

  • - system received in from procurement task T161723.
  • - system racked according to racking proposal.
  • - bios/drac/serial setup/testing
  • - mgmt and production dns entries added (internal vlan)
  • - sub-task created in networking netops SRE project for network port setup, include all port info
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

ores2004

  • - system received in from procurement task T161723.
  • - system racked according to racking proposal.
  • - bios/drac/serial setup/testing
  • - mgmt and production dns entries added (internal vlan)
  • - sub-task created in networking netops SRE project for network port setup, include all port info
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

ores2005

  • - system received in from procurement task T161723.
  • - system racked according to racking proposal.
  • - bios/drac/serial setup/testing
  • - mgmt and production dns entries added (internal vlan)
  • - sub-task created in networking netops SRE project for network port setup, include all port info
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

ores2006

  • - system received in from procurement task T161723.
  • - system racked according to racking proposal.
  • - bios/drac/serial setup/testing
  • - mgmt and production dns entries added (internal vlan)
  • - sub-task created in networking netops SRE project for network port setup, include all port info
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

ores2007

  • - system received in from procurement task T161723.
  • - system racked according to racking proposal.
  • - bios/drac/serial setup/testing
  • - mgmt and production dns entries added (internal vlan)
  • - sub-task created in networking netops SRE project for network port setup, include all port info
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

ores2008

  • - system received in from procurement task T161723.
  • - system racked according to racking proposal.
  • - bios/drac/serial setup/testing
  • - mgmt and production dns entries added (internal vlan)
  • - sub-task created in networking netops SRE project for network port setup, include all port info
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

ores2009

  • - system received in from procurement task T161723.
  • - system racked according to racking proposal.
  • - bios/drac/serial setup/testing
  • - mgmt and production dns entries added (internal vlan)
  • - sub-task created in networking netops SRE project for network port setup, include all port info
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

Related Objects

StatusSubtypeAssignedTask
Resolvedakosiaris
Resolvedakosiaris
ResolvedRobH
ResolvedRobH
Resolvedakosiaris
ResolvedRobH
Resolvedakosiaris
Resolvedawight
ResolvedHalfak
Resolvedawight
ResolvedNone
Resolvedawight
Resolvedawight
ResolvedHalfak
ResolvedHalfak
ResolvedHalfak
Resolvedakosiaris
ResolvedHalfak
ResolvedHalfak
ResolvedHalfak
ResolvedSumit

Event Timeline

Change 355270 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNs: Add mgmt and production DNS entries for ores200[1-9]

https://gerrit.wikimedia.org/r/355270

Change 355270 merged by Dzahn:
[operations/dns@master] DNs: Add mgmt and production DNS entries for ores200[1-9]

https://gerrit.wikimedia.org/r/355270

[bast1001:~] $ for orescodfw in $(seq 1 9); do host ores200${orescodfw}.codfw.wmnet; done
ores2001.codfw.wmnet has address 10.192.0.12
ores2002.codfw.wmnet has address 10.192.0.18
ores2003.codfw.wmnet has address 10.192.16.63
ores2004.codfw.wmnet has address 10.192.16.64
ores2005.codfw.wmnet has address 10.192.32.173
ores2006.codfw.wmnet has address 10.192.32.174
ores2007.codfw.wmnet has address 10.192.48.88
ores2008.codfw.wmnet has address 10.192.48.89
ores2009.codfw.wmnet has address 10.192.48.90

@akosiaris can you please provide me with a partman recipe to use? Thanks

IIRC these boxes don't have a RAID controller, so let's go for a RAID1 with LVM. Seems like raid1-lvm.cfg (https://github.com/wikimedia/puppet/blob/production/modules/install_server/files/autoinstall/partman/raid1-lvm.cfg) is fine.

@akosiaris: I'd actually recommend we go with something that uses a /srv/ mount and ext4, with no swap like most servers. But that is just so more servers tend to follow the same partitioning scheme, not due to an actual problem with the raid1-lvm.cfg recipe.

raid1-lvm puts the following:

#   - /	:   ext3, RAID1, 50GB
#   - swap:       RAID1, 1GB, on LVM
#   - free space for the rest under RAID1/LVM

raid1-lvm-ext4-srv-noswap has:

# * two disks, sda & sdb
# * layout:
#   - /	:   ext4, RAID1, 50GB
#   - /srv: ext4, RAID1/LVM, up to 80% of the total space
#   - free space for the rest under RAID1/LVM

Overall, the elimination of swap is one of my secondary projects, referenced on T156955.

Since we order our systems with enough memory, swap space on the disks is not needed. I'd recommend using the raid1-lvm-ext4-srv-noswap unless Alex disagrees with me. There is discussion on that task about use of swap though, and it goes in both directions, but these systems are SATA, and seems they won't benefit from swap as much.

Mostly we should simply attempt to settle on more standard partitioning across the fleet.

raid1-lvm-ext4-srv-noswap works too. Fine by me.

Change 355501 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP/partman: Add DHCP and partman entries for ores200[1-9]

https://gerrit.wikimedia.org/r/355501

Change 355501 merged by Dzahn:
[operations/puppet@production] DHCP/partman: Add DHCP and partman entries for ores200[1-9]

https://gerrit.wikimedia.org/r/355501

Papaul updated the task description. (Show Details)

@akosiaris This is complete at my end. It is all yours.

The salt key for ores2004 wasn't accepted (noticed that when rolling out the sudo update), this has been fixed.

akosiaris changed the task status from Open to Stalled.Aug 30 2017, 3:25 PM

Stalling a bit while T169246 is ongoing

Change 387811 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] New ORES codfw cluster isn't provisioned yet

https://gerrit.wikimedia.org/r/387811

Is this unstalled now? The reason was while T168246 is ongoing but that ticket is resolved. Is it really resolved though?

Change 396101 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add ores200[19] as spare systems

https://gerrit.wikimedia.org/r/396101

Change 396101 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add ores200[19] as spare systems

https://gerrit.wikimedia.org/r/396101

Change 396101 merged by Dzahn:
[operations/puppet@production] site: add ores200[19] as spare systems

https://gerrit.wikimedia.org/r/396101

So looks like we need a new role class for "regular ores server in production" (regular as opposed to ores::redis).

Because we have currently only ores::stresstest which is a temporary role for the stresstest (T169246) and ores::redis used on oresrdb*.

So the question is how does that role differ from the stresstest role, if at all?

The stresstest role currently includes profile::ores::worker and profile::ores::web, is that also what a regular (non-stresstest) role for an ores server would look like?

Would it be ores::worker which includes the worker profile but not the web profile?

Change 399452 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ores: basic role for a worker-only node

https://gerrit.wikimedia.org/r/399452

So looks like we need a new role class for "regular ores server in production" (regular as opposed to ores::redis).

Because we have currently only ores::stresstest which is a temporary role for the stresstest (T169246) and ores::redis used on oresrdb*.

ores::stresstest will be deleted. ores::redis clearly stays

So the question is how does that role differ from the stresstest role, if at all?

The stresstest role currently includes profile::ores::worker and profile::ores::web, is that also what a regular (non-stresstest) role for an ores server would look like?

Would it be ores::worker which includes the worker profile but not the web profile?

Yup pretty much the way you posted it

akosiaris changed the task status from Stalled to Open.Dec 21 2017, 9:48 AM

Unstalling. Note that we are in the year-end deployment freeze. The boxes should not be put in production until we are out of the freeze. Technically this currently means that the role should not be applied to them (we have no conftctl like switch for worker like loads). Same applies to changeprop.

Change 399595 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Reimage ores200* as stretch

https://gerrit.wikimedia.org/r/399595

Change 399595 merged by Alexandros Kosiaris:
[operations/puppet@production] Reimage ores200* as stretch

https://gerrit.wikimedia.org/r/399595

Script wmf-auto-reimage was launched by akosiaris on neodymium.eqiad.wmnet for hosts:

['ores2001.codfw.wmnet', 'ores2002.codfw.wmnet', 'ores2003.codfw.wmnet', 'ores2004.codfw.wmnet', 'ores2005.codfw.wmnet', 'ores2006.codfw.wmnet', 'ores2007.codfw.wmnet', 'ores2008.codfw.wmnet', 'ores2009.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201712211635_akosiaris_10155.log.

Completed auto-reimage of hosts:

['ores2008.codfw.wmnet', 'ores2002.codfw.wmnet', 'ores2003.codfw.wmnet', 'ores2005.codfw.wmnet', 'ores2007.codfw.wmnet', 'ores2009.codfw.wmnet', 'ores2006.codfw.wmnet', 'ores2004.codfw.wmnet', 'ores2001.codfw.wmnet']

and were ALL successful.

Change 399452 merged by Dzahn:
[operations/puppet@production] ores: basic role for a worker-only node

https://gerrit.wikimedia.org/r/399452

akosiaris updated the task description. (Show Details)

This is finally done, resolving

Change 387811 abandoned by Awight:
New ORES codfw cluster isn't provisioned yet

https://gerrit.wikimedia.org/r/387811