Page MenuHomePhabricator

rack/setup/install labvirt10(19|20).eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking and setup of two new labvirt hosts, labvirt1019 and labvirt1020. These hosts will be used to replace use of labsdb100[4-7], by running instances for the labsdb use.

Racking Proposal: These have to go into the labs-hosts1-b-eqiad vlan, so they need to rack in row B. The 10Gbit is for future expansion, so just wire up eth0 1Gbit for now. Don't rack both of these new hosts in the same rack, since they'll both be doing labsdb type instances.

HW Raid setup: Raid10 of the SSDs.

labvirt1019:

  • - receive in system on procurement task T162486
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, labs-hosts1-b-eqiad vlan)
    • end on-site specific steps
  • - production dns entries added (labs-hosts1-b-eqiad)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (trusty)
  • - puppet accept/initial run
  • - handoff for service implementation

labvirt1020:

  • - receive in system on procurement task T162486
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, labs-hosts1-b-eqiad vlan)
    • end on-site specific steps
  • - production dns entries added (labs-hosts1-b-eqiad)
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation (trusty)
  • - puppet/salt accept/initial run
  • - handoff for service implementation

Event Timeline

RobH created this task.Aug 4 2017, 4:25 PM
RobH edited projects, added ops-eqiad; removed procurement.
RobH updated the task description. (Show Details)
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Aug 14 2017, 2:31 PM

Change 372390 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding dns entries for labvirt1019-20 T172538

https://gerrit.wikimedia.org/r/372390

Change 372390 merged by Cmjohnson:
[operations/dns@master] Adding dns entries for labvirt1019-20 T172538

https://gerrit.wikimedia.org/r/372390

Cmjohnson updated the task description. (Show Details)Aug 30 2017, 12:50 AM
Cmjohnson updated the task description. (Show Details)Sep 8 2017, 4:18 PM

bios is setup, raid is configured to raid 10. switch ports need setup still

1019 -> b4 ge-4/0/33
1020 -> b7 ge-7/0/13

Cmjohnson updated the task description. (Show Details)Sep 11 2017, 6:57 PM

All the on-site work has been completed for labvirts1019-20. @RobH lmk if you want to take it from here

RobH claimed this task.Sep 11 2017, 6:59 PM
bd808 moved this task from Triage to Database on the Cloud-Services board.Sep 14 2017, 5:15 AM

Change 386266 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] install params for labvirt10[19-20]

https://gerrit.wikimedia.org/r/386266

Change 386266 merged by RobH:
[operations/puppet@production] install params for labvirt10[19-20]

https://gerrit.wikimedia.org/r/386266

RobH added a comment.EditedOct 24 2017, 9:08 PM

So labvirt1020 is being annoying, it gave me ssl errors trying to pull up https (not cert errors, negotiation errors) and then finally let me in. Sending reboots to it works, but vsp doesnt work (times out after working for a short bit)

333-HPE RESTful API Error - Unable to communicate with iLO FW. BIOS
configuration resources may not be up-to-date.
Action: Reset iLO FW and reboot the server. If issue persists, AC power cycle
the server.

also listed:
312-HPE Smart Storage Battery 1 Failure - Communication with the battery

So need to try to upload new firmware to this system.

labvirt1019 installed the OS, but even though its set to boot first to disk, it is looping back into the installer.

RobH updated the task description. (Show Details)
RobH removed a subscriber: Pswaby.
RobH reassigned this task from RobH to chasemp.Oct 25 2017, 12:30 AM
RobH updated the task description. (Show Details)
RobH removed a project: ops-eqiad.

both of these systems are now working and calling into puppet, ready for service implementation by Cloud-Services

Mentioned in SAL (#wikimedia-operations) [2017-12-19T15:25:54Z] <chasemp> labvirt10[19|20] aptitude install linux-image-4.4.0-81-generic linux-image-extra-4.4.0-81-generic; sudo update-grub; /sbin/reboot T172538

The original request was:

Disks: 8T after RAID1 with a hardware raid controller

in /T162486 and I see these with:

/dev/mapper/tank-data xfs 5.1T 34M 5.1T 1% /srv

@Andrew any idea why? @Cmjohnson?

Mentioned in SAL (#wikimedia-operations) [2018-02-14T19:16:40Z] <andrewbogott> rebooting labvirt1019 so I can have a look at the raid setup, for T172538

Drive config on the HPs is annoying. The steps are:

-reboot
-during boot, ESC-9
-select System Configuration->Embedded RAID 1 : Smart Array P440ar Controller->Exit and launch HP Smart Storage Administrator(HPSSA)

This will appear to error out, but if you wait a few minutes you eventually get a prompt. The prompt responds to the commands documented here:

https://kallesplayground.wordpress.com/useful-stuff/hp-smart-array-cli-commands-under-esxi/

(Note that those commands begin with /opt/hp/hpssacli/bin/hpssacli which is unneeded in this context.)

It looks like we just need to rebuild the raids on these. That's more-or-less impossible to do remotely so I'll create subtasks for Chris.

chasemp closed this task as Resolved.Apr 27 2018, 6:53 PM

in favor of T193264

bd808 moved this task from Inbox to Done on the cloud-services-team (Kanban) board.May 6 2018, 6:49 PM