Page MenuHomePhabricator

rack/setup/install cloudcephmon100[123]
Closed, ResolvedPublic0 Story Points

Description

This task will track the racking, setup, installation, and deployment of three new servers for ceph monitoring nodes with 10G connections.

Please note a number of pending questions for this task are also pending for related ceph nodes task T224188. Until some decisions are made on T224188, it will stall the racking/deployment of these hosts as well.

Hostname Proposal: cloudcephmon100* proposed by @Bstorm in irc when discussing these systems (later altered a bit by WMCS weekly meeting--now it is cloudcephmon100*).

Racking Proposal: These are NOT replacing any existing systems, but will need to communicate with the systems being racked on T224188. These will most likely replicate the racking layout of T224188 to produce the most redundancy within this service cluster.

cloudcephmon1001:

  • - receive in system on procurement task T222916
  • - add system into netbox while racking plan is being determined. This way it will show on the proper accounting reports.
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

cloudcephmon1002:

  • - receive in system on procurement task T222916
  • - add system into netbox while racking plan is being determined. This way it will show on the proper accounting reports.
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

cloudcephmon1003:

  • - receive in system on procurement task T222916
  • - add system into netbox while racking plan is being determined. This way it will show on the proper accounting reports.
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH triaged this task as Medium priority.Jul 15 2019, 8:13 PM
Restricted Application added a project: Operations. · View Herald TranscriptJul 15 2019, 8:14 PM

We already have some servers in a similar namespace: labmon1001 and labmon1002. I find it confusing that we use a similar naming scheme for 2 different types of servers.

I would recommend using something different for the new servers and leave cloudmon for when we refresh labmon servers.
Some ideas (no strong opinions about any):

  • cloudcephmon
  • cloudstoremon
  • cloudstorecephmon

cc @Bstorm

Bstorm added a comment.EditedJul 16 2019, 11:33 AM

Great point @aborrero! I almost half wanted to name all of these "cloudstore" and figure it out from there, but that's not great. cloudstoremon perhaps just to keep the brand out of the name. The OSDs are literally slated to be cloudosd. We'll see after a bit of discussion :)

Bstorm renamed this task from rack/setup/install cloudmon100[123] to rack/setup/install cloudcephmon100[123].Jul 16 2019, 7:45 PM
Bstorm updated the task description. (Show Details)

After talking in the weekly meeting, it's now cloudcephmon100*, updating the description.

Cmjohnson moved this task from Backlog to Racking Tasks on the ops-eqiad board.Jul 16 2019, 7:56 PM
Bstorm reassigned this task from Bstorm to RobH.Jul 25 2019, 6:38 PM

The racking proposal is detailed in T224188, so re-assigning

Cmjohnson reassigned this task from RobH to Jclark-ctr.Aug 14 2019, 3:06 PM
Cmjohnson added subscribers: Jclark-ctr, Cmjohnson.

@Jclark-ctr can you add asset tags and enter these servers into Netbox (T222916 is the procurement task). Leave them on the floor and the rack information blank in netbox until we know for sure where they're going. Once done, please re-assign back to Rob

Jclark-ctr reassigned this task from Jclark-ctr to RobH.Aug 20 2019, 6:32 PM
Jclark-ctr updated the task description. (Show Details)

asset tagged and added to Netbox

Jclark-ctr reassigned this task from Jclark-ctr to RobH.Aug 20 2019, 7:48 PM
Cmjohnson reassigned this task from RobH to Jclark-ctr.Aug 29 2019, 4:41 PM

@Jclark-ctr please rack 1 each in B2/B4/B7 please and update netbox

host                         	row	unit
cloudcephmon1001	b7	26
cloudcephmon1002	b4	11
cloudcephmon1003	b2	11

Change 534255 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for cloudcephmon100[1-3]

https://gerrit.wikimedia.org/r/534255

@Jclark-ctr Please set up the idrac and add the mgmt dns. Let me know if you have any issues or questions. I also need the switch ports.

+cloudcephmon1001 1H IN A 10.65.3.125
+cloudcephmon1002 1H IN A 10.65.3.126
+cloudcephmon1003 1H IN A 10.65.3.127

Change 534255 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for cloudcephmon100[1-3]

https://gerrit.wikimedia.org/r/534255

Cmjohnson updated the task description. (Show Details)Sep 6 2019, 5:48 PM
aborrero raised the priority of this task from Medium to High.Oct 4 2019, 8:48 AM

Raising priority of this ticket, since the ceph project is part of our Q2 goals.

@aborreo I need to know vlan requirements? Same as cephosd? 1 public 1 private?

RobH removed a subscriber: RobH.Oct 4 2019, 9:55 PM
Cmjohnson updated the task description. (Show Details)Oct 8 2019, 11:48 AM

Change 548878 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] install_server: add cloudcephmon servers

https://gerrit.wikimedia.org/r/548878

Change 549625 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/dns@master] wikimedia.org: add DNS entries for cloudceph mon and osd

https://gerrit.wikimedia.org/r/549625

Change 549625 merged by Jhedden:
[operations/dns@master] wikimedia.org: add DNS entries for cloudceph mon and osd

https://gerrit.wikimedia.org/r/549625

Change 549634 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/dns@master] wikimedia.org: update cloudcephmon and osd hostnames

https://gerrit.wikimedia.org/r/549634

Change 549634 merged by Jhedden:
[operations/dns@master] wikimedia.org: update cloudcephmon and osd hostnames

https://gerrit.wikimedia.org/r/549634

Change 548878 merged by Jhedden:
[operations/puppet@production] install_server: add cloudcephmon servers

https://gerrit.wikimedia.org/r/548878

Change 549664 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] install_server: Update cloudcephmons pxe interface

https://gerrit.wikimedia.org/r/549664

Change 549664 merged by Jhedden:
[operations/puppet@production] install_server: Update cloudcephmons pxe interface

https://gerrit.wikimedia.org/r/549664

Mentioned in SAL (#wikimedia-operations) [2019-11-08T10:34:01Z] <jeh> enable IPMI racadm set iDRAC.IPMILan.Enable 1 on cloudcephmon[1-3] T228102

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephmon1001.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911081559_jeh_227246.log.

Completed auto-reimage of hosts:

['cloudcephmon1001.wikimedia.org']

and were ALL successful.

Change 549889 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] install_server: update partition layout for cloudcephmon100[1-3]

https://gerrit.wikimedia.org/r/549889

Change 549889 merged by Jhedden:
[operations/puppet@production] install_server: update partition layout for cloudcephmon100[1-3]

https://gerrit.wikimedia.org/r/549889

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephmon1001.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911081636_jeh_236894.log.

Completed auto-reimage of hosts:

['cloudcephmon1001.wikimedia.org']

Of which those FAILED:

['cloudcephmon1001.wikimedia.org']
JHedden added a subscriber: JHedden.Fri, Nov 8, 5:28 PM

@Jclark-ctr Could you help me with the cloudcephmon1002 and cloudcephmon1003 servers? I'm unable to power them on through iDRAC SSH, IPMI, or the web interface. I see the event logged in the lifecycle log but they never power on.

Change 549928 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] install_server: add cloudcephosd partman config

https://gerrit.wikimedia.org/r/549928

JHedden updated the task description. (Show Details)Fri, Nov 8, 9:18 PM

Change 550339 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] ceph: add spare::system role to ceph mon and osd

https://gerrit.wikimedia.org/r/550339

Change 549928 merged by Jhedden:
[operations/puppet@production] install_server: add cloudcephosd partman config

https://gerrit.wikimedia.org/r/549928

Change 550339 merged by Jhedden:
[operations/puppet@production] ceph: add spare::system role to ceph mon and osd

https://gerrit.wikimedia.org/r/550339

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephmon1001.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911111721_jeh_9851.log.

Completed auto-reimage of hosts:

['cloudcephmon1001.wikimedia.org']

Of which those FAILED:

['cloudcephmon1001.wikimedia.org']
JHedden added a comment.EditedTue, Nov 12, 6:06 PM

@Jclark-ctr Could you help me with the cloudcephmon1002 and cloudcephmon1003 servers? I'm unable to power them on through iDRAC SSH, IPMI, or the web interface. I see the event logged in the lifecycle log but they never power on.

I've tried doing a racadm racreset but that didn't help. I think we'll need to have someone hard pull the power from these hosts to fully reset the system.

edit: this is a blocking issue, until we can control the power we cannot proceed with the OS installation

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephmon1002.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911122306_jeh_87570.log.

Completed auto-reimage of hosts:

['cloudcephmon1002.wikimedia.org']

and were ALL successful.

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephmon1003.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911122331_jeh_92744.log.

Completed auto-reimage of hosts:

['cloudcephmon1003.wikimedia.org']

and were ALL successful.

The mgmt issue seems to have been resolved, was this done by @Jclark-ctr I do not see an update

Cmjohnson updated the task description. (Show Details)Wed, Nov 13, 3:50 PM
Cmjohnson reassigned this task from Cmjohnson to Bstorm.Wed, Nov 13, 3:52 PM
Cmjohnson removed a project: ops-eqiad.

@Bstorm assigning to you to update netbox once the systems are online. I am removing the ops-eqiad tag. If you have an issue please add tag back. Thanks

JHedden closed this task as Resolved.Wed, Nov 13, 4:02 PM
JHedden updated the task description. (Show Details)

Change 550930 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] install_server: switch cloudcephosd to flat partition layout

https://gerrit.wikimedia.org/r/550930

Change 550930 merged by Jhedden:
[operations/puppet@production] install_server: switch cloudcephosd to flat partition layout

https://gerrit.wikimedia.org/r/550930