Page MenuHomePhabricator

rack/setup/install (3) new osd ceph nodes
Open, HighPublic

Description

This task will track the naming, racking, and setup of three new Ceph/OSD proof of concept nodes. This task has been filed WELL in advance of the hardware arrival, as there are likely network and racking considerations for discussion before racking location is determined.

Racking Location: All hosts in Row B, one host per rack to maintain some redundancy. Each rack should have one of these servers and one of the monitor servers from T228102

Hostname: cloudcephosd1001.wikimedia.org, cloudcephosd1002.wikimedia.org, cloudcephosd1003.wikimedia.org for public interfaces

Networking note: One 10g ethernet network connection to the public subnet (wikimedia.org) and one to the private, internal (eqiad.wmnet) should be on each host.

System #1 cloudcephosd1001:

  • - receive in system on procurement task T221698
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan) (ensure from comments where networking/vlan needs to be and if it needs more than one interface connected)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

System #2 cloudcephosd1002:

  • - receive in system on procurement task T221698
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan) (ensure from comments where networking/vlan needs to be and if it needs more than one interface connected)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

System #3 cloudcephosd1003:

  • - receive in system on procurement task T221698
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan) (ensure from comments where networking/vlan needs to be and if it needs more than one interface connected)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Worth noting that even though we will be using 10G links, we don't expect them to be fully used in any case in the short term.

If we set all the CEPH servers in row B (for the initial PoC) we could avoid saturating any upstream link or device. Only Top-of-rack switches (asw2-b-eqiad).
I know 10G in row B is limited right now, but I don't see any other option.
Rate limiting should be possible both at client and server levels (iptables should do the trick, or alternatively, a similar tc setup like we have right now with NFS). Also possible at network hardware level (i.e, switches) I think. Also, what @ayounsi mentioned could be a good starting point, using only 1x10G in each servers instead of 2x10G.

So here is my proposal:

  • rack each of the 6 new servers (3x mons, 3x OSDs) in different row B racks, using 1x10G links in each server
  • implement any simple rate limiting in case we want to be extra sure that we don't fully use 10G in every server

I'm just trying to find a compromise between our goals and what the network can handle.

That sounds reasonable for the PoC, depending on rack space. @faidon for the last word.

Note that we don't have visibility in the cross virtual chassis links. Adding it to LibreNMS is possible but would require dev time.

That sounds reasonable for the PoC, depending on rack space. @faidon for the last word.
Note that we don't have visibility in the cross virtual chassis links. Adding it to LibreNMS is possible but would require dev time.

Fair. The virtual chassis links, Are they DAC cables or fiber? In any case, how much throughput do they have?

Bstorm added a subscriber: JHedden.EditedJun 24 2019, 2:20 PM

Note that LibreNMS have a 5min granularity. That mean if a sudden spike of traffic appear, it will not get noticed right away.
We also have alerting for when the link reach 80% utilization, with the same 5min caveat.
A "real time" view exists (eg. https://librenms.wikimedia.org/device/device=2/tab=port/port=139/view=realtime/ ) but it needs to be used carefully to not overwhelm the router's SNMP daemon.

Cool, thanks.

Jumbo frames are enabled everywhere on the switch side, so make sure the proper MTU is set on the host side if you want to use it as a "natural" rate limiter.

Oooh! Also, cool. That's one of the strongest recommendations from the ceph community and also our new tech with experience in the area, @JHedden :)

Rate limiting on the network side is usually not advised as it have a bad performance hit. Better send the packets slower than create an artificial bottleneck that the sending host have to detect and work around (TCP scaling, etc...).

I'll have to investigate some of the options outside of tc. There may be interesting blockers on the Openstack side and things like that we find...
Planning (with watermarks about ceph network capacity) is going to be essential.

Only using the public interface could be a way to "naturally" rate limit the cluster (10G total, instead of a theoretical max of 20G per host).

True, but the effect could be large. The backend will respond with the public interfaces, but it also generates a lot of its own traffic. The biggest problem I see there is a build-out phase requiring more hosts than it would otherwise, which goes back to that fewer, big nodes and many, smaller nodes tradeoff issue.

Other option I see would be to keep all the nodes in row B, keeping the impact radius of the cluster miss-behaving to that one row. This also removes the cross row client traffic.

This will likely only work during the PoC phase (because we are going to eat all the ports) and is obviously extending our HA problems down the road. It might be safer for the PoC in case we want to actually try to fill a link and see how hard that is to do (likely very!). I do concur with @aborrero that we are extremely unlikely to exceed the capacity of the links, but the theoretical possibility exists.

I do not like the idea of using only a single 10Gb link on each host if we can possibly avoid it because we will lose visibility into the behavior during our PoC (which eliminates some of our ability to answer these questions in the future) and it expands the ability to inadvertently DoS the cluster (which I do imagine to be possible for our users if all is on one link). We will have more limited information, and it is not best practice. Initial design is the most likely reason the cluster will later have problems, and the PoC should attempt to simulate the rollout where possible.

Had a huddle with @JHedden, actually. He'll add his thoughts soon (with a some info from our existing monitoring).

There's a lot of good information in this task. I'm still catching up, but I wanted to note that it's important to consider the replication factor when designing the network architecture for Ceph. By default Ceph uses synchronous replicated pools, which ensures that data is physically copied to multiple OSDs before sending the acknowledgment to the client. This leads to another benefit of segmenting the public and cluster network traffic. For every single write request on the public network, there are 2 replicated writes on the cluster network.

Using average SATA SSD 500MB/s read and 300MB/s write speeds, the theoretical maximum bandwidth available per storage host is 4,000MB/s read and 2,400MB/s write.

If we could achieve these theoretical numbers the maximum network bandwidth per IO type would look like:
Per storage host:

typetheoretical maxpublic networkcluster network
read4000MB/s32Gb/s0
write2400MB/s19Gb/s38Gb/s

(Max value calculated from drive speed * 8 OSDs)

Aggregated storage cluster bandwidth:

typetheoretical maxpublic networkcluster network
read12,000MB/s96Gb/s0
write7,200MB/s58Gb/s116Gb/s

(Max value calculated from drive speed * 8 OSDs * 3 nodes)

While theoretical is fun, real world is better. To get an idea of what this would look like today, here's some graphite metrics from the last 24 hours on the OpenStack hypervisors.

Total aggregated IO across all hypervisors

typepeak valueest public networkest cluster network
read25MB/s200Mb/s0Mb/s
write9MB/s72Mb/s144Mb/s
total34MB/s272Mb/s144Mb/s

95th percentile aggregated IO across all hypervisors (we have a few noisy VMs)

typepeak valueest public networkest cluster network
read10MB/s80Mb/s0Mb/s
write3.5MB/s28Mb/s56Mb/s
total13.5MB/s108Mb/s56Mb/s

Graphite queries used to collect the data:

  • sumSeries(scaleToSeconds(servers.cloudvirt1*.iostat.*.write_byte_per_second,1))
  • sumSeries(scaleToSeconds(servers.cloudvirt1*.iostat.*.read_byte_per_second,1))
  • percentileOfSeries(scaleToSeconds(servers.cloudvirt1*.iostat.*.write_byte_per_second,1),95)
  • percentileOfSeries(scaleToSeconds(servers.cloudvirt1*.iostat.*.read_byte_per_second,1),95)
Bstorm added a comment.EditedJun 24 2019, 7:41 PM

I should point out that the PoC will not be capable of doing anywhere near that much IO. That would be what it would look like if we managed to convert the entire cluster to Ceph with a full build out. We would not handle the full buildout with three OSDs because of those numbers above. However, that's the throughput the clients would be jostling for on Row B in a fully build-out condition on the frontend network. The backend network and frontend need to be split out so that things are not going through too few pipes.

@JHedden does make the very good point that backend writes are 3x the volume of frontend writes, which is why I think it is a terrible idea to use only one 10G port for each server (that goes for the mon servers as well). If you use a single port, you are quadrupling any actual IO.

Bstorm added a comment.EditedJun 24 2019, 7:52 PM

Ok, that said, I did write that misreading Mbps for Gbps...but what I said is still true! The PoC won't be anywhere near all that, and our full build out is a trickle compared to theoretical limits--and we might even be able to converge the two neworks, but I still dislike the option.

So figuring, based on that data, that it may not be impossible to fill the link, it's extremely unlikely that we will (and we still would love to use jumbo frames), can we put this on other rows?

The above question is aimed at @ayounsi and @faidon.

Per what was decided by WMCS in T228102, the hostname proposal is now cloudcephosd100* for the three. Updating the description with that much at least.

Bstorm updated the task description. (Show Details)Jul 16 2019, 7:48 PM

I had a conversation with @faidon today, and I think the best way to move forward with this particular task is to ask if there is rack space and 10G ports available in Row B not just for these three, but also the three systems in T228102 (assuming they are cabled with one 10G port on public and one 10G port on internal networks for each server). This is so the PoC project can serve to determine precisely what the network needs are in the future so we know how best to proceed then with the future full build.

@RobH would you have that information (whether there's enough room now for the three in this task AND the three in the other task)? If so, we might be able to move forward.

Discussed with @RobH IRC. This is doable as long as it can wait behind some 10G decommissions, which seems fine to me.
Updating the description to capture everything as much as possible.

Bstorm reassigned this task from ayounsi to RobH.Jul 25 2019, 6:37 PM
Bstorm updated the task description. (Show Details)

Note that there are 38 servers using SFP-Ts, which mean using 1G on a 10G switch.

asw2-b-eqiad> show chassis hardware | match SFP-T | count 
Count: 38 lines

Ideally those should be the first ones to move out.

Cmjohnson reassigned this task from RobH to Jclark-ctr.Aug 14 2019, 3:05 PM
Cmjohnson added a subscriber: Jclark-ctr.

@Jclark-ctr can you add asset tags and enter these servers into Netbox (T221698 is the procurement task). Leave them on the floor and the rack information blank in netbox until we know for sure where they're going. Once done, please re-assign back to Rob

RobH added a comment.Aug 14 2019, 3:07 PM

Please do not assign this to me, it is awaiting installation by DC ops into 10G racks, and not on me.

This should be processed by the on-site engineers in eqiad and racked as soon as 10G become available for them.

Change 530246 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for cloudceph10[1-3]

https://gerrit.wikimedia.org/r/530246

Change 530246 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for cloudceph10[1-3]

https://gerrit.wikimedia.org/r/530246

cloudcephosd1001 10.65.2.177
cloudcephosd1002 10.65.2.178
cloudcephosd1003 10.65.2.179

Cmjohnson updated the task description. (Show Details)Aug 15 2019, 12:51 AM

added asset tags updated Netbox

marilerr closed this task as Declined.Aug 24 2019, 3:20 AM
JJMC89 reopened this task as Open.Aug 24 2019, 3:21 AM

@Jclark-ctr please rack 1 each in B2/B4/B7 please and update netbox

host	                              row	unit
cloudcephosd1001	b7	27
cloudcephosd1002	b4	12
cloudcephosd1003	b2	12

host row unit port
cloudcephosd1001 b7 27 39/25
cloudcephosd1002 b4 12 43/42
cloudcephosd1003 b2 12 35/13

Jclark-ctr updated the task description. (Show Details)
aborrero triaged this task as High priority.Oct 4 2019, 8:49 AM

Raising priority of this ticket, since the ceph project is part of our Q2 goals.

Cmjohnson updated the task description. (Show Details)Oct 4 2019, 7:00 PM

The network switch config is done, the main port is on public vlan and the 2nd port is on private

RobH removed a subscriber: RobH.Oct 4 2019, 9:56 PM

The management interface for cloudcephosd1001.mgmt is currently unavailable, could we get someone take a look at it please?

bast1002.wikimedia.org
jeh@bast1002:~$ host cloudcephosd1001.mgmt.eqiad.wmnet
cloudcephosd1001.mgmt.eqiad.wmnet has address 10.65.2.177
jeh@bast1002:~$ ping -c 1 cloudcephosd1001.mgmt.eqiad.wmnet
PING cloudcephosd1001.mgmt.eqiad.wmnet (10.65.2.177) 56(84) bytes of data.

--- cloudcephosd1001.mgmt.eqiad.wmnet ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

(note the other cloudcephosds and cloudcephmons mgmt interface is working properly, it's just this host)

@JHedden . Found host did not have ip address in it. Reentered address and mgnt password

@Jclark-ctr thanks! confirmed that it's working from my end now.

Change 549625 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/dns@master] wikimedia.org: add DNS entries for cloudceph mon and osd

https://gerrit.wikimedia.org/r/549625

Change 549625 merged by Jhedden:
[operations/dns@master] wikimedia.org: add DNS entries for cloudceph mon and osd

https://gerrit.wikimedia.org/r/549625

Change 549634 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/dns@master] wikimedia.org: update cloudcephmon and osd hostnames

https://gerrit.wikimedia.org/r/549634

Change 549634 merged by Jhedden:
[operations/dns@master] wikimedia.org: update cloudcephmon and osd hostnames

https://gerrit.wikimedia.org/r/549634

Mentioned in SAL (#wikimedia-operations) [2019-11-08T10:33:31Z] <jeh> enable IPMI racadm set iDRAC.IPMILan.Enable 1 on cloudcephosd[1-3] T224188

Change 550339 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] ceph: add spare::system role to ceph mon and osd

https://gerrit.wikimedia.org/r/550339

Change 550339 merged by Jhedden:
[operations/puppet@production] ceph: add spare::system role to ceph mon and osd

https://gerrit.wikimedia.org/r/550339

Cmjohnson reassigned this task from Cmjohnson to JHedden.Wed, Nov 13, 5:54 PM
Cmjohnson removed a project: ops-eqiad.

@JHedden What is the status of these servers, it looks like most everything is finished but the checkboxes are not complete. I believe the ops-eqiad work is finished so I am removing the tag. Please resolve this task when you see fit.

JHedden updated the task description. (Show Details)Wed, Nov 13, 6:57 PM
ayounsi removed a subscriber: ayounsi.Wed, Nov 13, 6:57 PM
JHedden updated the task description. (Show Details)Wed, Nov 13, 7:00 PM

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1001.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911132125_jeh_73756.log.

Completed auto-reimage of hosts:

['cloudcephosd1001.wikimedia.org']

Of which those FAILED:

['cloudcephosd1001.wikimedia.org']

Change 550763 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] install_server: add boot partition to cloudcephosd config

https://gerrit.wikimedia.org/r/550763

Change 550763 merged by Jhedden:
[operations/puppet@production] install_server: add boot partition to cloudcephosd config

https://gerrit.wikimedia.org/r/550763

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1001.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911132244_jeh_88199.log.

Completed auto-reimage of hosts:

['cloudcephosd1001.wikimedia.org']

and were ALL successful.

JHedden updated the task description. (Show Details)Wed, Nov 13, 11:07 PM

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1002.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911141550_jeh_90677.log.

Completed auto-reimage of hosts:

['cloudcephosd1002.wikimedia.org']

Of which those FAILED:

['cloudcephosd1002.wikimedia.org']

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1002.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911141607_jeh_94737.log.

Completed auto-reimage of hosts:

['cloudcephosd1002.wikimedia.org']

Of which those FAILED:

['cloudcephosd1002.wikimedia.org']

Change 551226 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] install_server: update cloudcephosd partman config

https://gerrit.wikimedia.org/r/551226

Change 551226 merged by Jhedden:
[operations/puppet@production] install_server: update cloudcephosd partman config

https://gerrit.wikimedia.org/r/551226

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1002.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911151709_jeh_92596.log.

I'm having an issue on cloudcephosd1002 and 1003.

Using the first 2 240GB drives I created a RAID0 virtual disk, and specified that as the boot device in the PERC H730P raid array. The OS installation process works as expected, but after a reboot the server is unable to boot from the virtual disk.

After a lot of digging, I found that this process works but it doesn't persist after a reboot:

  1. on the raid card set a non-raid physical drive as the boot device
  2. let the next boot fail, press F2 and return to controller management
  3. on the raid card set the virtual disk as the boot device
  4. the server boots as expected.

Once the server is cold booted, even though the virtual drive is still set as the boot device the host fails to boot.

@Cmjohnson @Jclark-ctr any ideas?

Looks like it's an issue with the virtual disk not getting assigned /dev/sda. Checking to see if I can work around this with our installation process and partman, but I may need to switch the controller to HBA mode and use software RAID for the operating system.

Change 551256 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] install_server: update cloudcephosd root disk

https://gerrit.wikimedia.org/r/551256

Change 551256 merged by Jhedden:
[operations/puppet@production] install_server: update cloudcephosd root disk

https://gerrit.wikimedia.org/r/551256

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1002.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911152041_jeh_132144.log.

Change 551267 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] install_server: cloudcephosd update grub bootdev

https://gerrit.wikimedia.org/r/551267

Change 551267 merged by Jhedden:
[operations/puppet@production] install_server: cloudcephosd update grub bootdev

https://gerrit.wikimedia.org/r/551267

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1002.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911152117_jeh_138328.log.

Completed auto-reimage of hosts:

['cloudcephosd1002.wikimedia.org']

and were ALL successful.

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1003.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911152139_jeh_143586.log.

Completed auto-reimage of hosts:

['cloudcephosd1003.wikimedia.org']

and were ALL successful.

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1001.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911152158_jeh_146929.log.

Completed auto-reimage of hosts:

['cloudcephosd1001.wikimedia.org']

Of which those FAILED:

['cloudcephosd1001.wikimedia.org']
JHedden updated the task description. (Show Details)Fri, Nov 15, 10:55 PM

I've update the task details with the current status. Should I leave the netbox status as staged or set it to active? These systems will be testing non-production workloads for the near future, and active seems to imply production status.