Page MenuHomePhabricator

rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems
Closed, ResolvedPublic

Description

This task will track the racking/setup/installation of 4 new systems ordered for the cloud services elastic search replica cluster.

Hostname considerations: @RobH picked cloudelastic, for the cloud teams replication of the elastic systems. If another name is preferred, just document it on Infrastructure_naming_conventions and remove the cloudelastic entry.

Racking Proposal: spread evenly across all rows.

cloudelastic1001:

  • - receive in system on procurement task T187627
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation - stretch
  • - puppet accept/initial run
  • - handoff for service implementation

cloudelastic1002:

  • - receive in system on procurement task T187627
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation - stretch
  • - puppet accept/initial run
  • - handoff for service implementation

cloudelastic1003:

  • - receive in system on procurement task T187627
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation - stretch
  • - puppet accept/initial run
  • - handoff for service implementation

cloudelastic1004:

  • - receive in system on procurement task T187627
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation - stretch
  • - puppet accept/initial run
  • - handoff for service implementation

Event Timeline

RobH created this task.May 8 2018, 5:14 PM
RobH triaged this task as Normal priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 8 2018, 5:14 PM
RobH added a comment.May 8 2018, 5:20 PM

@bd808 or @chasemp: Before @Cmjohnson racks these, I'd like to confirm the networking requirements.

These have 10Gbit networking, so they will go in 10G racks. Then we would want to know if they will be the labs-support-vlan is the one we'll be using, or something else. (Since that can affect what rows these go in.) Finally, I assume we want them spread out as much as possible, so different rows if we can, and at minimum different racks.

@chasemp noted they are at a conference, so this may have to wait a day or two before we see answers.

Cmjohnson moved this task from Backlog to Being worked on on the ops-eqiad board.May 15 2018, 3:46 PM

@chasemp please let me know network requirements.

bd808 added a comment.May 29 2018, 3:27 PM

Then we would want to know if they will be the labs-support-vlan is the one we'll be using, or something else. (Since that can affect what rows these go in.)

We have been trying not to add new hosts into the labs-support-vlan since the security exposure of that vlan is so confusing. We have been putting things into the public vlan instead as that makes it more obvious that Cloud Services support boxes are functionally exposed to anyone on the internet (after signing up for a developer account and joining a Cloud VPS project). @chasemp can you confirm that public vlan is correct for these hosts?

Finally, I assume we want them spread out as much as possible, so different rows if we can, and at minimum different racks.

+1 to spreading across rows and racks as much as possible.

Then we would want to know if they will be the labs-support-vlan is the one we'll be using, or something else. (Since that can affect what rows these go in.)

We have been trying not to add new hosts into the labs-support-vlan since the security exposure of that vlan is so confusing. We have been putting things into the public vlan instead as that makes it more obvious that Cloud Services support boxes are functionally exposed to anyone on the internet (after signing up for a developer account and joining a Cloud VPS project). @chasemp can you confirm that public vlan is correct for these hosts?

Public is correct current best practice. Then we firewall it down the the narrowest requestors possible.

@chasemp please let me know network requirements.

We would like 10G if possible. These hosts will be tracking the production CirrusSearch data feed and also serving responses to Cloud Services users. And as noted above, they should be placed in the public vlan and spread across rows/racks as much as possible for redundancy.

Cmjohnson updated the task description. (Show Details)Jun 28 2018, 2:35 PM
Vvjjkkii renamed this task from rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems to 9cdaaaaaaa.Jul 1 2018, 1:11 AM
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii removed RobH as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot assigned this task to RobH.
CommunityTechBot lowered the priority of this task from High to Normal.
CommunityTechBot renamed this task from 9cdaaaaaaa to rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems.
CommunityTechBot added a subscriber: Aklapper.

Change 448062 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt/production dns cloudelastic1001-4

https://gerrit.wikimedia.org/r/448062

Change 448062 merged by Cmjohnson:
[operations/dns@master] Adding mgmt/production dns cloudelastic1001-4

https://gerrit.wikimedia.org/r/448062

Cmjohnson updated the task description. (Show Details)Jul 30 2018, 3:14 PM
Cmjohnson updated the task description. (Show Details)
Cmjohnson moved this task from Racking Tasks to Blocked on the ops-eqiad board.Aug 2 2018, 7:21 PM

these servers are ready for install, assigning to @RobH for help.

RobH added a comment.Aug 2 2018, 7:37 PM
This comment was removed by RobH.
RobH updated the task description. (Show Details)

Change 450092 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setup of cloudelastic100[1-4].wikimedia.org

https://gerrit.wikimedia.org/r/450092

Change 450092 merged by RobH:
[operations/puppet@production] setup of cloudelastic100[1-4].wikimedia.org

https://gerrit.wikimedia.org/r/450092

Change 450095 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] adding to netboot for cloudelastic systems

https://gerrit.wikimedia.org/r/450095

Change 450096 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] updating netboot.cfg for cloudelastic

https://gerrit.wikimedia.org/r/450096

Change 450096 merged by RobH:
[operations/puppet@production] updating netboot.cfg for cloudelastic

https://gerrit.wikimedia.org/r/450096

Change 450144 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] fixing new partman recipe

https://gerrit.wikimedia.org/r/450144

Change 450144 merged by RobH:
[operations/puppet@production] fixing new partman recipe

https://gerrit.wikimedia.org/r/450144

RobH updated the task description. (Show Details)
RobH reassigned this task from RobH to Gehel.

@Gehel & @EBernhardson: I'm assinging this to @Gehel as the SRE team member involved with this project, for service implementation.

Dzahn added a subscriber: Dzahn.Aug 31 2018, 7:27 PM

icinga reports that on cloudelastic1002 device sdb is not healthy per SMART

cluster=misc device=sdb instance=cloudelastic1002:9100 job=node site=eqiad

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cloudelastic1002&service=Device+not+healthy+-SMART-

debt closed this task as Resolved.