Page MenuHomePhabricator

rack/setup/install mw2251-mw2260
Closed, DuplicatePublic

Description

This task will track the racking, setup, and installation of the 10 new mw appservers for codfw. The task will have two parts, first determining where to rack the systems, and the second part will be implementation.

racking location determination / old system decom determination

There is an open question on where these will rack. Presently, rack A4 has a lot of space, since a large number of old appservers were decommissioned and removed from them last summer, and replaced with new systems in racks A3 and A4. A4 still has less sytems than A3, these could help balance that.

Additionally, the other mw racks in codfw include: B3, B4, C3, & C4. These racks are full of older mw systems. The 10 oldest mw systems in production use @ codfw are mw2075-mw2084. Half of these are in A4, the other half in B3. FYI: The codfw mw cluster is out of warranty up to host mw2134. hosts mw2135+ are still under warranty past January 2017 (or further.) So racks A4 (5), B3 (all out of warranty), B4 (2/3rds out of warranty) have some out of warranty hosts in them. Row c has only under warranty mw hosts.

Do we need to rack these 10 newly leased systems in different racks than the other newest hosts in A4? If not, continuing to fill it up will keep our racking of mw systems in codfw in sequence. This isn't a hard requirement, but it does help age out systems gracefully from the racks with ease. Unless these new systems must be located away from this summer's mw order, I'd (@RobH) suggest filling out A4.

We'll also need to decide how many older hosts to decommission. If we place these into A4, they have plenty of space in the rack so they'll go in the bottom of the rack, where no existing hosts are placed (since the decom of older hosts this summer.) If we want to place the hosts in rows B or C, we'll need to decom or move hosts around. We'll need to figure this out in advance of racking the hosts if we want the new hosts in B or C rows, if we want them in A4 or row D (no mw systems in row D yet) then we don't need to determine this before the new servers are racked.

Assigning this task to @Joe for his input on where to rack (if it matters) and what hosts to decom (if needed for racking.) Please detail/comment and assign back to @RobH, thanks!

implementation

  • - receive in systems normally per parent task T151779
  • - mgmt dns entries (asset tag + hostnames) and production dns (hostname, internal vlan)
  • - rack according to determination made in part 1 "racking location determination" of this task.
  • - bios/drac setup/testing
  • - update or create sub task with network port info
  • - install_module updates
  • - install OS (jessie)
  • - accept/sign puppet/salt
  • - service implementation

Event Timeline

Assigning this task to @Joe for his input on where to rack (if it matters) and what hosts to decom (if needed for racking.) Please detail/comment and assign back to @RobH, thanks!

RobH updated the task description. (Show Details)

My suggestion would be that these 10 new systems should replace mw2075 - mw2090 functionally, and specifically:

  • 3 servers to replace the 5 API appservers mw2075-79 so in row A
  • 4 servers to replace 6 jobrunners mw2080-2085 so in row B
  • 3 servers to replace 4 imagescalers mw2086-2090 so in row B

I just want to maintain consistency in the spread out of appservers as they're pretty well balanced in codfw.

@Joe are those systems already decommissioned?

@Papaul: The systems he listed are not currently offline, as they show in monitoring.

@Joe: That sounds reasonable to me. Will the servers they are replacing be able to come offline in advance of the servers being racked, or do they need to stay online until the new systems are fully online? (If they can come offline in advance, it makes @Papaul's job of racking a lot easier, since he can just re-use all the network/power and sometimes rails.

Since we are replacing 5 servers with 3, 6 with 4, and 4 with 3, we can take down the minimum number needed per row, if we need the capacity in codfw.

So in row A, we'd only take down 3 of the 5 API servers to be replaced, bring online the new systems, and then the other 2 can come offline at any time. We'd do the same in each row if needed. This slows down things slightly, but leaves more capacity online in codfw (if needed.) This is a half-way measure between taking down the servers to be replaced in advance of racking the new ones.

So we have 3 options:

  1. Decommission all the systems that will be replaced; rack new systems in existing system spots.
  2. Decommisison minimum number of old systems to fit new systems in rack; decommission remainder at later date.
  3. Leave all existing systems fully online until new systems are fully ready to take load.

Please advise

@RobH option 1 seems good.

A side note: why are we reusing old hostnames? we never did that in eqiad and I thought that was a policy.

The last new mw servers we put in, we didn't reused old hostnames we started with mw2215 and the last one is mw2250, I have already put in racktable mw2251-mw2260 for the 10 new mw servers.

Indeed, we shouldn't be reusing the old hostnames, and I didn't think we planned to. (Seems that @Papaul is also on the same page!)

I'll go ahead and decommission the existing hosts so they can be pulled/wiped and replaced with the new hosts.

RobH renamed this task from rack/setup/install mw2051-mw2060 to rack/setup/install mw2251-mw2260.Jan 5 2017, 1:10 AM