Page MenuHomePhabricator

es2011-es2019 racking and onsite setup tasks
Closed, ResolvedPublic

Description

This task will outline where the new es2011-2019 systems (ordered on T118174) should be racked and setup.

Our ES systems work in service groups of 3 systems per cluster, and three clusters to deploy with our new order. These new systems will replace the use of the current es2001-2010, which we will decommission afterwards. (The old ES systems are all out of warranty R510s.) No single service cluster should be in less than two rows. Since we have 4 rows to work with, we'll put once system per service group per row. The current deployment of ES systems is only in rows A and B, we'll re-balance this going forward.

This plan leaves the a6/b6 racks for population last, and they likely won't have room until after the existing es2002, es2004, es2006, and es2009 are removed from those racks. IRC conversation with @jcrespo notes that we don't have to have all 9 new systems racked immediately, we can rack a service group, transfer data from the older es systems, and then decommission those older es systems once the data is migrated to the new es system service group. As such, the racking of any systems into full racks should pause until the ES systems in those racks are removed. (This will require some juggling with Jaime on what to rack first in those cases.)

hostnamerackservice group
es2011b11
es2012c11
es2013d11
es2014a12
es2015c12
es2016d12
es2017a63
es2018b63
es2019d63

es201[1-9]

  • - rack and update racktables
  • - add mgmt dns entries (both asset tag and hostname)
  • - setup and test bios/ilom/redirection
  • - setup raid10 of disks in hardware raid
  • - switch ports setup (description/enable/vlan)
  • - add production dns entries (internal vlan)
  • - update install_server module for system (dhcp and netboot entries) standard db partitioning
  • - install OS - jessie
  • - sign/accept puppet/salt keys

Once you have the network port info on the task, feel free to ping @RobH in IRC to get the switch ports setup quickly.

Thanks!

Event Timeline

RobH claimed this task.
RobH raised the priority of this task from to Medium.
RobH updated the task description. (Show Details)
RobH added projects: procurement, SRE, DBA.
RobH edited projects, added ops-codfw; removed DBA, procurement.
RobH set Security to None.
RobH added subscribers: mark, emailbot, RobH and 4 others.

My proposed racking plan above should be reality checked by both @jcrespo (for ES configuration and review) and @Papaul (for power and outlet availability in each rack.)

@Papaul: If any of the proposed racks won't have either enough power ports, power overhead (<8.6kW), or network ports, please update this task to reflect; keeping in mind the existing es systems will be coming out as we roll in the new es systems.

RobH removed a parent task: Unknown Object (Task).Feb 5 2016, 5:36 PM
RobH added a subtask: Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).Feb 5 2016, 5:40 PM

+1
with this proposal, the cloning will be like this:

es2001 -> es2011	b1	1
es2002 -> es2012	c1	1
es2004 -> es2013	d1	1
es2005 -> es2014	a1	2
es2006 -> es2015	c1	2
es2007 -> es2016	d1	2
es2008 -> es2017	a6	3
es2009 -> es2018	b6	3
es2010 -> es2019	d6	3

Service group 1 is read only (it only gets written when service groups are rebalanced), no special care is needed there except migrating the existing data.

For service groups 2 and 3, the idea would be to keep alwats the master running receiving the updates from eqiad. Current masters are es2006 and es2008, but they can be changed at any time.

In summary: proposed order for racking:

es2001 -> es2011	b1	1
es2002 -> es2012	c1	1
es2004 -> es2013	d1	1

(migrate data, decom 3 old ones)

es2005 -> es2014	a1	2
es2007 -> es2016	d1	2
es2009 -> es2018	b6	3
es2010 -> es2019	d6	3

(migrate data, decom 4 old ones, failover master)

es2006 -> es2015	c1	2
es2008 -> es2017	a6	3

(decom 2 old ones)

RobH reassigned this task from RobH to Papaul.EditedFeb 11 2016, 11:08 PM

@Papaul:

I should have requested that you do the following for the new systems as you rack them, so I've updated the task summary to list them. (All the checkbox install steps)

Once you have the network port info on the task, feel free to ping me in IRC and I can get the switch ports setup quickly.

All the 9 new es systems are racked.

Papaul renamed this task from es2011-es2020 racking and onsite setup tasks to es2011-es2019 racking and onsite setup tasks.Feb 12 2016, 5:36 PM

Switch ports information

es2011 ge-1/0/9 rack B1
es2012 ge-1/0/0 rack C1
es2013 ge-1/0/4 rack D1
es2014 ge-1/0/5 rack A1
es2015 ge-1/0/1 rack C1
es2016 ge-1/0/5 rack D1
es2017 ge-6/0/19 rack A6
es2018 ge-6/0/19 rack B6
es2019 ge-6/0/19 rack D6

es2011     ge-1/0/9 rack B1
es2012     ge-1/0/0 rack C1
es2013     ge-1/0/4  rack D1
es2014     ge-1/0/5 rack A1
es2015     ge-1/0/1 rack C1
es2016     ge-1/0/5  rack D1
es2017     ge-6/0/19 rack A6
es2018     ge-6/0/19 rack B6
es2019     ge-6/0/19 rack D6

port descriptions set, enabled, and vlans set.

I would like @Volans to at least do one full install.

The installation recipe db currently has one bug and it is not fully unattended, here I comment the fix: https://gerrit.wikimedia.org/r/#/c/267328/ (at the bottom).

@jcrespo. The onsite setup for es20[1-9][0-9] is complete. As you requested for Volans to do a full install, he can work on es2019.
@Volans let me know if you have any questions.

Change 271131 had a related patch set uploaded (by Dzahn):
dhcp: add es201[1-8]

https://gerrit.wikimedia.org/r/271131

Volans did a reimage of another server, so I can handle these ones. I am missing the MAC of the latest server (2009) and proper IPs for the non-mgmt interfaces to proceed.

The reason you are missing the MAC address of es2019 and the production IP is because I left it so Volans can do that. Since he is no longer doing the re-imaging I will work on it.

Usually the onsite person does the system install and puppet cert and salt key sing, once that is done, he hands the system(s) to the person responsible for implementation. I have no problem you doing the install if you want but just let me know so i don't start the install on my end while you are also doing the install.

Thanks.

I have been discussing with Jaime on IRC on the db.cfg not being really unattended since he is getting the "confirming partitioning write to disk" message. I reproduce the same type of message in a lab environnemnt, see image below. With the line d-i partman-auto/confirm boolean true
you get this message when you change the line to
d-i partman/confirm boolean true
you don't get the message .

image001.png (470×707 px, 55 KB)

Change 271280 had a related patch set uploaded (by Jcrespo):
Fix typo that prevented from a fully unattended install

https://gerrit.wikimedia.org/r/271280

Change 271280 merged by Jcrespo:
Fix typo that prevented from a fully unattended install

https://gerrit.wikimedia.org/r/271280

Change 271319 had a related patch set uploaded (by Dzahn):
dhcpd: fix MAC of es2014, add es2019

https://gerrit.wikimedia.org/r/271319

Change 271319 merged by Dzahn:
dhcpd: fix MAC of es2014, add es2019

https://gerrit.wikimedia.org/r/271319

Papaul updated the task description. (Show Details)

@jcrespo Installation complete an all new es servers.

RobH closed subtask Unknown Object (Task) as Resolved.Jul 11 2016, 5:32 PM