Page MenuHomePhabricator

rack/setup/install cloudvirtan100[1-5].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of cloudvirtan100[1-5].eqiad.wmnet. These are analytics hadoop worker node specific hardware allocated to the cloud data lake project. They should be in the Cloud VPS networks and will be set up as Cloud Virt nodes in the new cloud-analytics project.

These nodes have 10G NICs, so please make sure they are using 10G switch ports.

cloudvirtan1001:

  • - receive in system on procurement task T204177
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - second 10G interface connected and setup similar to cloudvirt secondary connections
  • - handoff for service implementation
  • - service implementation team should change the netbox status from STAGED to ACTIVE.

cloudvirtan1002:

  • - receive in system on procurement task T204177
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - second 10G interface connected and setup similar to cloudvirt secondary connections
  • - handoff for service implementation
  • - service implementation team should change the netbox status from STAGED to ACTIVE.

cloudvirtan1003:

  • - receive in system on procurement task T204177
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - second 10G interface connected and setup similar to cloudvirt secondary connections
  • - handoff for service implementation
  • - service implementation team should change the netbox status from STAGED to ACTIVE.

cloudvirtan1004:

  • - receive in system on procurement task T204177
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - second 10G interface connected and setup similar to cloudvirt secondary connections
  • - handoff for service implementation
  • - service implementation team should change the netbox status from STAGED to ACTIVE.

cloudvirtan1005:

  • - receive in system on procurement task T204177
  • - rack system with proposed racking plan (see T201352) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - second 10G interface connected and setup similar to cloudvirt secondary connections
  • - handoff for service implementation
  • - service implementation team should change the netbox status from STAGED to ACTIVE.

Event Timeline

RobH triaged this task as Normal priority.Oct 16 2018, 5:04 PM
RobH created this task.
RobH reassigned this task from Cmjohnson to Ottomata.Oct 16 2018, 5:11 PM
RobH added a subscriber: Cmjohnson.

So, we are not sure about what vlan these will be going into. This could affect what row they go into.

@Ottomata: Can you advise what vlan/subnet these will be going into? Cloud services has a few different ones, I assumed labs-support vlan, but not sure.

Please advise and assign back to either myself or Chris to determine if our racking proposal (10G racks evenly spaced between racks and rows) will work.

They need to be reachable by the Analytics VLAN, so I would normally propose that one. Since this is a special case, maybe it makes more sense to put them into a Cloud VPS VLAN? Not sure. Maybe @chasemp can advise?

Cmjohnson moved this task from Backlog to Racking Tasks on the ops-eqiad board.Oct 18 2018, 4:13 PM
Cmjohnson updated the task description. (Show Details)Oct 25 2018, 6:13 PM
Ottomata renamed this task from rack/setup/install ca-worker100[1-5].eqiad.wmnet to rack/setup/install cloudvirtan100[1-5].eqiad.wmnet.Nov 13 2018, 4:19 PM
Ottomata updated the task description. (Show Details)

@Ottomata can these go in any row or does it need to be row B?

IIUC it has to be row B for them to be used as Cloud Virts. @Andrew to confirm. If they can go any row, then they should be spread out as evenly amongst as many rows as possible.

Yep, Row B. @Cmjohnson, these are to be handled like any other cloudvirt server. For network details it might be best to consult with @ayounsi

Ping on this! I know it is TG week so things might be slow, but I'm checking in anyway :)

RobH updated the task description. (Show Details)Nov 19 2018, 5:41 PM
RobH updated the task description. (Show Details)

Bump! What's the timeline on getting these racked?

Ottomata reassigned this task from Ottomata to RobH.Nov 27 2018, 5:31 PM

(robh reassign as appropriate)

Change 476317 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for cloudvirtan100[1-5}

https://gerrit.wikimedia.org/r/476317

Change 476317 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for cloudvirtan100[1-5}

https://gerrit.wikimedia.org/r/476317

Cmjohnson updated the task description. (Show Details)Nov 28 2018, 9:35 PM

@RobH these are ready for installs, I changed the primary nic to boot from the 10G NIC. The raid was set up exactly like an analytics box raid 1 the ssds and raid 0 the other 11 disks.

@Cmjohnson I don't know what the proper disk layout for these are, since they will be Cloud Virt nodes. I doubt RAID 0 is the way... @Andrew can you comment?

Andrew added a comment.EditedNov 29 2018, 9:06 PM

@Cmjohnson I don't know what the proper disk layout for these are, since they will be Cloud Virt nodes. I doubt RAID 0 is the way... @Andrew can you comment?

Having the two ssds in a raid 1 sounds right to me.

For the main storage, I'm not sure that it matters a lot. We use raid 10 on our standard virt boxes; that would be my preference in this case but it'll get you less total storage available.

Hmm, ok, then I think in this case RAID 0 is fine. Since these will have Hadoop, data will be replicated across nodes 3x anyway.

As I understand it, with raid 0 if a single drive dies the whole system (and containing VM) will have to be rebuilt. Also I expect that we have to shut the servers off to replace the drive, although I'm not certain about that.

As long as your worker nodes are going to be true cattle, disposable and easy to rebuild, then raid 0 seems fine.

Hm. They are cattle, but it would probably be nice if the whole node doesn't go down if we lose a drive, and we can deal with the loss of storage from 10. We should do RAID 10.

@Cmjohnson sorry, can we redo this with RAID 10 instead?

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.Dec 3 2018, 2:29 PM
RobH added a comment.EditedDec 3 2018, 5:55 PM

So, this isn't 100% clear to me off that task.

@Ottomata Will the first ethernet port on these hosts be in cloud-hosts1-b-eqiad? Will this need a second attached and in what vlan?

RobH added a comment.Dec 3 2018, 5:58 PM

Updated from IRC chat with Otto: These should have identical networking vlan setup as the cloudvirts. So we'll have to add the secondary port to the switch post OS install.

Change 477317 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] setting production dns for cloudvirtan100[1-5]]

https://gerrit.wikimedia.org/r/477317

Change 477317 merged by RobH:
[operations/dns@master] setting production dns for cloudvirtan100[1-5]]

https://gerrit.wikimedia.org/r/477317

Change 477322 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] adding ipv6 for cloudvirtan hosts

https://gerrit.wikimedia.org/r/477322

Change 477322 merged by RobH:
[operations/dns@master] adding ipv6 for cloudvirtan hosts

https://gerrit.wikimedia.org/r/477322

Change 477327 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setting cloudvirtan install params

https://gerrit.wikimedia.org/r/477327

Change 477327 merged by RobH:
[operations/puppet@production] setting cloudvirtan install params

https://gerrit.wikimedia.org/r/477327

RobH updated the task description. (Show Details)
This comment was removed by RobH.
RobH updated the task description. (Show Details)Dec 4 2018, 12:28 AM
RobH updated the task description. (Show Details)Dec 4 2018, 4:46 AM
RobH reassigned this task from RobH to Cmjohnson.Dec 4 2018, 4:53 AM

Ok, this has had puppet run on all of the hosts. This is now ready for @Cmjohnson to attach the other 10G interface.

Chris, once the secondary interface is connected (the same fashion that the other cloudvirts are), then this can be assigned to @Ottomata for analytics use.

Cmjohnson reassigned this task from Cmjohnson to RobH.Dec 5 2018, 3:46 PM

@RobH

all have been cabled and switch port updated minus the vlan. Can you please update vlan to whatever they need.

xe-2/0/6 up up cloudvirtan1001 eth0
xe-2/0/7 up up cloudvirtan1002 eth0
xe-2/0/12 up up cloudvirtan1001 eth1
xe-2/0/13 up up cloudvirtan1002 eth1
xe-4/0/5 up up cloudvirtan1003 eth0
xe-4/0/10 up up cloudvirtan1003 eth1
xe-7/0/24 up up cloudvirtan1004 eth0
xe-7/0/25 up up cloudvirtan1005 eth0
xe-7/0/30 up up cloudvirtan1004 eth1
xe-7/0/31 up up cloudvirtan1005 eth1

RobH reassigned this task from RobH to Ottomata.Dec 5 2018, 6:23 PM
RobH removed a project: ops-eqiad.

@RobH
all have been cabled and switch port updated minus the vlan. Can you please update vlan to whatever they need.
xe-2/0/6 up up cloudvirtan1001 eth0
xe-2/0/7 up up cloudvirtan1002 eth0
xe-2/0/12 up up cloudvirtan1001 eth1
xe-2/0/13 up up cloudvirtan1002 eth1
xe-4/0/5 up up cloudvirtan1003 eth0
xe-4/0/10 up up cloudvirtan1003 eth1
xe-7/0/24 up up cloudvirtan1004 eth0
xe-7/0/25 up up cloudvirtan1005 eth0
xe-7/0/30 up up cloudvirtan1004 eth1
xe-7/0/31 up up cloudvirtan1005 eth1

[edit interfaces interface-range cloud-virt-instance-trunk]
     member ge-5/0/11 { ... }
+    member xe-2/0/12;
+    member xe-2/0/13;
+    member xe-4/0/10;
+    member xe-7/0/30;
+    member xe-7/0/31;

Ok, this is now ready for Analytics team take over. As @Ottomata was the person from that team commenting on this task, I'm assigning this to him.

This can be resolved once you are aware of it and take over for service implementation Analytics team!

RobH updated the task description. (Show Details)Dec 5 2018, 6:36 PM

Change 477830 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Make cloudvirtan100x boxes eqiad1 labvirts

https://gerrit.wikimedia.org/r/477830

Change 477830 merged by Andrew Bogott:
[operations/puppet@production] Make cloudvirtan100x boxes eqiad1 labvirts

https://gerrit.wikimedia.org/r/477830

Andrew closed this task as Resolved.Dec 5 2018, 9:36 PM
Andrew updated the task description. (Show Details)