Page MenuHomePhabricator

rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems
Closed, ResolvedPublic

Description

This task will track the racking/setup/installation of 4 new systems ordered for the cloud services elastic search replica cluster.

Hostname considerations: @RobH picked cloudelastic, for the cloud teams replication of the elastic systems. If another name is preferred, just document it on Infrastructure_naming_conventions and remove the cloudelastic entry.

Racking Proposal: spread evenly across all rows.

cloudelastic1001:

  • - receive in system on procurement task T187627
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation - stretch
  • - puppet accept/initial run
  • - handoff for service implementation

cloudelastic1002:

  • - receive in system on procurement task T187627
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation - stretch
  • - puppet accept/initial run
  • - handoff for service implementation

cloudelastic1003:

  • - receive in system on procurement task T187627
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation - stretch
  • - puppet accept/initial run
  • - handoff for service implementation

cloudelastic1004:

  • - receive in system on procurement task T187627
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation - stretch
  • - puppet accept/initial run
  • - handoff for service implementation

Details

Related Gerrit Patches:
operations/puppet : productionfixing new partman recipe
operations/puppet : productionupdating netboot.cfg for cloudelastic
operations/puppet : productionadding to netboot for cloudelastic systems
operations/puppet : productionsetup of cloudelastic100[1-4].wikimedia.org
operations/dns : masterAdding mgmt/production dns cloudelastic1001-4

Event Timeline

RobH triaged this task as Medium priority.May 8 2018, 5:14 PM
RobH created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 8 2018, 5:14 PM
RobH added a comment.May 8 2018, 5:20 PM

@bd808 or @chasemp: Before @Cmjohnson racks these, I'd like to confirm the networking requirements.

These have 10Gbit networking, so they will go in 10G racks. Then we would want to know if they will be the labs-support-vlan is the one we'll be using, or something else. (Since that can affect what rows these go in.) Finally, I assume we want them spread out as much as possible, so different rows if we can, and at minimum different racks.

@chasemp noted they are at a conference, so this may have to wait a day or two before we see answers.

@chasemp please let me know network requirements.

Then we would want to know if they will be the labs-support-vlan is the one we'll be using, or something else. (Since that can affect what rows these go in.)

We have been trying not to add new hosts into the labs-support-vlan since the security exposure of that vlan is so confusing. We have been putting things into the public vlan instead as that makes it more obvious that Cloud Services support boxes are functionally exposed to anyone on the internet (after signing up for a developer account and joining a Cloud VPS project). @chasemp can you confirm that public vlan is correct for these hosts?

Finally, I assume we want them spread out as much as possible, so different rows if we can, and at minimum different racks.

+1 to spreading across rows and racks as much as possible.

Then we would want to know if they will be the labs-support-vlan is the one we'll be using, or something else. (Since that can affect what rows these go in.)

We have been trying not to add new hosts into the labs-support-vlan since the security exposure of that vlan is so confusing. We have been putting things into the public vlan instead as that makes it more obvious that Cloud Services support boxes are functionally exposed to anyone on the internet (after signing up for a developer account and joining a Cloud VPS project). @chasemp can you confirm that public vlan is correct for these hosts?

Public is correct current best practice. Then we firewall it down the the narrowest requestors possible.

@chasemp please let me know network requirements.

We would like 10G if possible. These hosts will be tracking the production CirrusSearch data feed and also serving responses to Cloud Services users. And as noted above, they should be placed in the public vlan and spread across rows/racks as much as possible for redundancy.

Cmjohnson updated the task description. (Show Details)Jun 28 2018, 2:35 PM
Vvjjkkii renamed this task from rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems to 9cdaaaaaaa.Jul 1 2018, 1:11 AM
Vvjjkkii removed RobH as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from 9cdaaaaaaa to rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems.Jul 2 2018, 6:13 AM
CommunityTechBot assigned this task to RobH.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

Change 448062 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt/production dns cloudelastic1001-4

https://gerrit.wikimedia.org/r/448062

Change 448062 merged by Cmjohnson:
[operations/dns@master] Adding mgmt/production dns cloudelastic1001-4

https://gerrit.wikimedia.org/r/448062

Cmjohnson updated the task description. (Show Details)Jul 30 2018, 3:14 PM
Cmjohnson updated the task description. (Show Details)
Cmjohnson moved this task from Racking Tasks to Blocked on the ops-eqiad board.Aug 2 2018, 7:21 PM

these servers are ready for install, assigning to @RobH for help.

RobH added a comment.Aug 2 2018, 7:37 PM
This comment was removed by RobH.
RobH updated the task description. (Show Details)

Change 450092 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setup of cloudelastic100[1-4].wikimedia.org

https://gerrit.wikimedia.org/r/450092

Change 450092 merged by RobH:
[operations/puppet@production] setup of cloudelastic100[1-4].wikimedia.org

https://gerrit.wikimedia.org/r/450092

Change 450095 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] adding to netboot for cloudelastic systems

https://gerrit.wikimedia.org/r/450095

Change 450096 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] updating netboot.cfg for cloudelastic

https://gerrit.wikimedia.org/r/450096

Change 450096 merged by RobH:
[operations/puppet@production] updating netboot.cfg for cloudelastic

https://gerrit.wikimedia.org/r/450096

Change 450144 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] fixing new partman recipe

https://gerrit.wikimedia.org/r/450144

Change 450144 merged by RobH:
[operations/puppet@production] fixing new partman recipe

https://gerrit.wikimedia.org/r/450144

RobH reassigned this task from RobH to Gehel.Aug 2 2018, 9:47 PM
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)

@Gehel & @EBernhardson: I'm assinging this to @Gehel as the SRE team member involved with this project, for service implementation.

Dzahn added a subscriber: Dzahn.Aug 31 2018, 7:27 PM

icinga reports that on cloudelastic1002 device sdb is not healthy per SMART

cluster=misc device=sdb instance=cloudelastic1002:9100 job=node site=eqiad

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cloudelastic1002&service=Device+not+healthy+-SMART-

debt closed this task as Resolved.Apr 15 2019, 6:03 PM