⚓ T224188 rack/setup/install (3) new osd ceph nodes

Subject	Repo	Branch	Lines +/-
install_server: cloudcephosd update grub bootdev	operations/puppet	production	+1 -0
install_server: update cloudcephosd root disk	operations/puppet	production	+8 -4
install_server: update cloudcephosd partman config	operations/puppet	production	+7 -8
install_server: add boot partition to cloudcephosd config	operations/puppet	production	+17 -7
ceph: add spare::system role to ceph mon and osd	operations/puppet	production	+10 -0
wikimedia.org: update cloudcephmon and osd hostnames	operations/dns	master	+24 -24
wikimedia.org: add DNS entries for cloudceph mon and osd	operations/dns	master	+26 -0
Adding mgmt dns for cloudceph10[1-3]	operations/dns	master	+12 -0

In T224188#5277181, @ayounsi wrote:

Note that LibreNMS have a 5min granularity. That mean if a sudden spike of traffic appear, it will not get noticed right away.
We also have alerting for when the link reach 80% utilization, with the same 5min caveat.
A "real time" view exists (eg. https://librenms.wikimedia.org/device/device=2/tab=port/port=139/view=realtime/ ) but it needs to be used carefully to not overwhelm the router's SNMP daemon.

Cool, thanks.

Jumbo frames are enabled everywhere on the switch side, so make sure the proper MTU is set on the host side if you want to use it as a "natural" rate limiter.

Oooh! Also, cool. That's one of the strongest recommendations from the ceph community and also our new tech with experience in the area, @JHedden :)

Rate limiting on the network side is usually not advised as it have a bad performance hit. Better send the packets slower than create an artificial bottleneck that the sending host have to detect and work around (TCP scaling, etc...).

I'll have to investigate some of the options outside of tc. There may be interesting blockers on the Openstack side and things like that we find...
Planning (with watermarks about ceph network capacity) is going to be essential.

Only using the public interface could be a way to "naturally" rate limit the cluster (10G total, instead of a theoretical max of 20G per host).

True, but the effect could be large. The backend will respond with the public interfaces, but it also generates a lot of its own traffic. The biggest problem I see there is a build-out phase requiring more hosts than it would otherwise, which goes back to that fewer, big nodes and many, smaller nodes tradeoff issue.

Other option I see would be to keep all the nodes in row B, keeping the impact radius of the cluster miss-behaving to that one row. This also removes the cross row client traffic.

This will likely only work during the PoC phase (because we are going to eat all the ports) and is obviously extending our HA problems down the road. It might be safer for the PoC in case we want to actually try to fill a link and see how hard that is to do (likely very!). I do concur with @aborrero that we are extremely unlikely to exceed the capacity of the links, but the theoretical possibility exists.

I do not like the idea of using only a single 10Gb link on each host if we can possibly avoid it because we will lose visibility into the behavior during our PoC (which eliminates some of our ability to answer these questions in the future) and it expands the ability to inadvertently DoS the cluster (which I do imagine to be possible for our users if all is on one link). We will have more limited information, and it is not best practice. Initial design is the most likely reason the cluster will later have problems, and the PoC should attempt to simulate the rollout where possible.

Had a huddle with @JHedden, actually. He'll add his thoughts soon (with a some info from our existing monitoring).

There's a lot of good information in this task. I'm still catching up, but I wanted to note that it's important to consider the replication factor when designing the network architecture for Ceph. By default Ceph uses synchronous replicated pools, which ensures that data is physically copied to multiple OSDs before sending the acknowledgment to the client. This leads to another benefit of segmenting the public and cluster network traffic. For every single write request on the public network, there are 2 replicated writes on the cluster network.

Using average SATA SSD 500MB/s read and 300MB/s write speeds, the theoretical maximum bandwidth available per storage host is 4,000MB/s read and 2,400MB/s write.

If we could achieve these theoretical numbers the maximum network bandwidth per IO type would look like:
Per storage host:

type	theoretical max	public network	cluster network
read	4000MB/s	32Gb/s	0
write	2400MB/s	19Gb/s	38Gb/s

(Max value calculated from drive speed * 8 OSDs)

Aggregated storage cluster bandwidth:

type	theoretical max	public network	cluster network
read	12,000MB/s	96Gb/s	0
write	7,200MB/s	58Gb/s	116Gb/s

(Max value calculated from drive speed * 8 OSDs * 3 nodes)

While theoretical is fun, real world is better. To get an idea of what this would look like today, here's some graphite metrics from the last 24 hours on the OpenStack hypervisors.

Total aggregated IO across all hypervisors

type	peak value	est public network	est cluster network
read	25MB/s	200Mb/s	0Mb/s
write	9MB/s	72Mb/s	144Mb/s
total	34MB/s	272Mb/s	144Mb/s

95th percentile aggregated IO across all hypervisors (we have a few noisy VMs)

type	peak value	est public network	est cluster network
read	10MB/s	80Mb/s	0Mb/s
write	3.5MB/s	28Mb/s	56Mb/s
total	13.5MB/s	108Mb/s	56Mb/s

Graphite queries used to collect the data:

sumSeries(scaleToSeconds(servers.cloudvirt1*.iostat.*.write_byte_per_second,1))
sumSeries(scaleToSeconds(servers.cloudvirt1*.iostat.*.read_byte_per_second,1))
percentileOfSeries(scaleToSeconds(servers.cloudvirt1*.iostat.*.write_byte_per_second,1),95)
percentileOfSeries(scaleToSeconds(servers.cloudvirt1*.iostat.*.read_byte_per_second,1),95)

I should point out that the PoC will not be capable of doing anywhere near that much IO. That would be what it would look like if we managed to convert the entire cluster to Ceph with a full build out. We would not handle the full buildout with three OSDs because of those numbers above. However, that's the throughput the clients would be jostling for on Row B in a fully build-out condition on the frontend network. The backend network and frontend need to be split out so that things are not going through too few pipes.

@JHedden does make the very good point that backend writes are 3x the volume of frontend writes, which is why I think it is a terrible idea to use only one 10G port for each server (that goes for the mon servers as well). If you use a single port, you are quadrupling any actual IO.

Ok, that said, I did write that misreading Mbps for Gbps...but what I said is still true! The PoC won't be anywhere near all that, and our full build out is a trickle compared to theoretical limits--and we might even be able to converge the two neworks, but I still dislike the option.

So figuring, based on that data, that it may not be impossible to fill the link, it's extremely unlikely that we will (and we still would love to use jumbo frames), can we put this on other rows?

The above question is aimed at @ayounsi and @faidon.

bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Jul 11 2019, 4:28 PM

RobH mentioned this in T228102: rack/setup/install cloudcephmon100[123].Jul 15 2019, 8:13 PM

Per what was decided by WMCS in T228102, the hostname proposal is now cloudcephosd100* for the three. Updating the description with that much at least.

• Bstorm updated the task description. (Show Details)Jul 16 2019, 7:48 PM

I had a conversation with @faidon today, and I think the best way to move forward with this particular task is to ask if there is rack space and 10G ports available in Row B not just for these three, but also the three systems in T228102 (assuming they are cabled with one 10G port on public and one 10G port on internal networks for each server). This is so the PoC project can serve to determine precisely what the network needs are in the future so we know how best to proceed then with the future full build.

@RobH would you have that information (whether there's enough room now for the three in this task AND the three in the other task)? If so, we might be able to move forward.

Discussed with @RobH IRC. This is doable as long as it can wait behind some 10G decommissions, which seems fine to me.
Updating the description to capture everything as much as possible.

• Bstorm reassigned this task from ayounsi to RobH.Jul 25 2019, 6:37 PM

• Bstorm updated the task description. (Show Details)

Note that there are 38 servers using SFP-Ts, which mean using 1G on a 10G switch.

asw2-b-eqiad> show chassis hardware | match SFP-T | count 
Count: 38 lines

Ideally those should be the first ones to move out.

@Jclark-ctr can you add asset tags and enter these servers into Netbox (T221698 is the procurement task). Leave them on the floor and the rack information blank in netbox until we know for sure where they're going. Once done, please re-assign back to Rob

Please do not assign this to me, it is awaiting installation by DC ops into 10G racks, and not on me.

This should be processed by the on-site engineers in eqiad and racked as soon as 10G become available for them.

Change 530246 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for cloudceph10[1-3]

https://gerrit.wikimedia.org/r/530246

Change 530246 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for cloudceph10[1-3]

https://gerrit.wikimedia.org/r/530246

cloudcephosd1001 10.65.2.177
cloudcephosd1002 10.65.2.178
cloudcephosd1003 10.65.2.179

• Cmjohnson updated the task description. (Show Details)Aug 15 2019, 12:51 AM

Maintenance_bot removed a project: Patch-For-Review.Aug 15 2019, 1:10 AM

added asset tags updated Netbox

• marilerr closed this task as Declined.Aug 24 2019, 3:20 AM

JJMC89 reopened this task as Open.Aug 24 2019, 3:21 AM

@Jclark-ctr please rack 1 each in B2/B4/B7 please and update netbox

host	                              row	unit
cloudcephosd1001	b7	27
cloudcephosd1002	b4	12
cloudcephosd1003	b2	12

host row unit port
cloudcephosd1001 b7 27 39/25
cloudcephosd1002 b4 12 43/42
cloudcephosd1003 b2 12 35/13

Jclark-ctr reassigned this task from Jclark-ctr to • Cmjohnson.Sep 11 2019, 12:38 PM

Jclark-ctr updated the task description. (Show Details)

bd808 moved this task from Doing to Watching on the cloud-services-team (Kanban) board.Sep 12 2019, 9:29 PM

Raising priority of this ticket, since the ceph project is part of our Q2 goals.

The network switch config is done, the main port is on public vlan and the 2nd port is on private

RobH unsubscribed.Oct 4 2019, 9:56 PM

The management interface for cloudcephosd1001.mgmt is currently unavailable, could we get someone take a look at it please?

bast1002.wikimedia.org
jeh@bast1002:~$ host cloudcephosd1001.mgmt.eqiad.wmnet
cloudcephosd1001.mgmt.eqiad.wmnet has address 10.65.2.177
jeh@bast1002:~$ ping -c 1 cloudcephosd1001.mgmt.eqiad.wmnet
PING cloudcephosd1001.mgmt.eqiad.wmnet (10.65.2.177) 56(84) bytes of data.

--- cloudcephosd1001.mgmt.eqiad.wmnet ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

(note the other cloudcephosds and cloudcephmons mgmt interface is working properly, it's just this host)

@JHedden . Found host did not have ip address in it. Reentered address and mgnt password

@Jclark-ctr thanks! confirmed that it's working from my end now.

Change 549625 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/dns@master] wikimedia.org: add DNS entries for cloudceph mon and osd

https://gerrit.wikimedia.org/r/549625

gerritbot added a project: Patch-For-Review.Nov 7 2019, 8:17 PM

Change 549625 merged by Jhedden:
[operations/dns@master] wikimedia.org: add DNS entries for cloudceph mon and osd

https://gerrit.wikimedia.org/r/549625

Change 549634 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/dns@master] wikimedia.org: update cloudcephmon and osd hostnames

https://gerrit.wikimedia.org/r/549634

Change 549634 merged by Jhedden:
[operations/dns@master] wikimedia.org: update cloudcephmon and osd hostnames

https://gerrit.wikimedia.org/r/549634

Maintenance_bot removed a project: Patch-For-Review.Nov 7 2019, 9:10 PM

Mentioned in SAL (#wikimedia-operations) [2019-11-08T10:33:31Z] <jeh> enable IPMI racadm set iDRAC.IPMILan.Enable 1 on cloudcephosd[1-3] T224188

Change 550339 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] ceph: add spare::system role to ceph mon and osd

https://gerrit.wikimedia.org/r/550339

gerritbot added a project: Patch-For-Review.Nov 11 2019, 5:14 PM

Change 550339 merged by Jhedden:
[operations/puppet@production] ceph: add spare::system role to ceph mon and osd

https://gerrit.wikimedia.org/r/550339

Maintenance_bot removed a project: Patch-For-Review.Nov 11 2019, 6:10 PM

@JHedden What is the status of these servers, it looks like most everything is finished but the checkboxes are not complete. I believe the ops-eqiad work is finished so I am removing the tag. Please resolve this task when you see fit.

• JHedden updated the task description. (Show Details)Nov 13 2019, 6:57 PM

ayounsi unsubscribed.Nov 13 2019, 6:57 PM

• JHedden updated the task description. (Show Details)Nov 13 2019, 7:00 PM

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1001.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911132125_jeh_73756.log.

Completed auto-reimage of hosts:

['cloudcephosd1001.wikimedia.org']

Of which those FAILED:

['cloudcephosd1001.wikimedia.org']

Change 550763 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] install_server: add boot partition to cloudcephosd config

https://gerrit.wikimedia.org/r/550763

gerritbot added a project: Patch-For-Review.Nov 13 2019, 10:22 PM

Change 550763 merged by Jhedden:
[operations/puppet@production] install_server: add boot partition to cloudcephosd config

https://gerrit.wikimedia.org/r/550763

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1001.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911132244_jeh_88199.log.

Completed auto-reimage of hosts:

['cloudcephosd1001.wikimedia.org']

and were ALL successful.

• JHedden updated the task description. (Show Details)Nov 13 2019, 11:07 PM

Maintenance_bot removed a project: Patch-For-Review.Nov 13 2019, 11:10 PM

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1002.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911141550_jeh_90677.log.

Completed auto-reimage of hosts:

['cloudcephosd1002.wikimedia.org']

Of which those FAILED:

['cloudcephosd1002.wikimedia.org']

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1002.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911141607_jeh_94737.log.

Completed auto-reimage of hosts:

['cloudcephosd1002.wikimedia.org']

Of which those FAILED:

['cloudcephosd1002.wikimedia.org']

Change 551226 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] install_server: update cloudcephosd partman config

https://gerrit.wikimedia.org/r/551226

gerritbot added a project: Patch-For-Review.Nov 15 2019, 4:54 PM

Change 551226 merged by Jhedden:
[operations/puppet@production] install_server: update cloudcephosd partman config

https://gerrit.wikimedia.org/r/551226

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1002.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911151709_jeh_92596.log.

Maintenance_bot removed a project: Patch-For-Review.Nov 15 2019, 5:10 PM

I'm having an issue on cloudcephosd1002 and 1003.

Using the first 2 240GB drives I created a RAID0 virtual disk, and specified that as the boot device in the PERC H730P raid array. The OS installation process works as expected, but after a reboot the server is unable to boot from the virtual disk.

After a lot of digging, I found that this process works but it doesn't persist after a reboot:

on the raid card set a non-raid physical drive as the boot device
let the next boot fail, press F2 and return to controller management
on the raid card set the virtual disk as the boot device
the server boots as expected.

Once the server is cold booted, even though the virtual drive is still set as the boot device the host fails to boot.

@Cmjohnson @Jclark-ctr any ideas?

Looks like it's an issue with the virtual disk not getting assigned /dev/sda. Checking to see if I can work around this with our installation process and partman, but I may need to switch the controller to HBA mode and use software RAID for the operating system.

Change 551256 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] install_server: update cloudcephosd root disk

https://gerrit.wikimedia.org/r/551256

gerritbot added a project: Patch-For-Review.Nov 15 2019, 8:20 PM

Change 551256 merged by Jhedden:
[operations/puppet@production] install_server: update cloudcephosd root disk

https://gerrit.wikimedia.org/r/551256

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1002.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911152041_jeh_132144.log.

Change 551267 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] install_server: cloudcephosd update grub bootdev

https://gerrit.wikimedia.org/r/551267

Change 551267 merged by Jhedden:
[operations/puppet@production] install_server: cloudcephosd update grub bootdev

https://gerrit.wikimedia.org/r/551267

Maintenance_bot removed a project: Patch-For-Review.Nov 15 2019, 9:10 PM

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1002.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911152117_jeh_138328.log.

Completed auto-reimage of hosts:

['cloudcephosd1002.wikimedia.org']

and were ALL successful.

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1003.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911152139_jeh_143586.log.

Completed auto-reimage of hosts:

['cloudcephosd1003.wikimedia.org']

and were ALL successful.

Script wmf-auto-reimage was launched by jeh on cumin1001.eqiad.wmnet for hosts:

['cloudcephosd1001.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201911152158_jeh_146929.log.

Completed auto-reimage of hosts:

['cloudcephosd1001.wikimedia.org']

Of which those FAILED:

['cloudcephosd1001.wikimedia.org']

• JHedden updated the task description. (Show Details)Nov 15 2019, 10:55 PM

I've update the task details with the current status. Should I leave the netbox status as staged or set it to active? These systems will be testing non-production workloads for the near future, and active seems to imply production status.

• JHedden closed this task as Resolved.Dec 6 2019, 10:40 PM

• JHedden mentioned this in T240965: Enable private network interface on Ceph OSD and MON hosts.Dec 17 2019, 4:35 PM

bd808 added a parent task: T194334: [Epic] Modern Cloud VPS storage layer.Oct 5 2020, 12:17 AM

rack/setup/install (3) new osd ceph nodes
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Status	Assigned	Task
		Unknown Object (Task)
Resolved	Andrew	T216218 Cloud VPS outage on cloudvirt1024 and cloudvirt1018 due to storage failure
Open	None	T220020 Action items and work for retro 20190403
Resolved	• JHedden	T207590 Research CephFS as a replacement for NFS
Resolved	Andrew	T194334 [Epic] Modern Cloud VPS storage layer
Resolved	• JHedden	T225320 Ceph Proof of Concept Build and Testing
Resolved	• JHedden	T90364 Test Ceph for instance storage
		Unknown Object (Task)
Resolved	• JHedden	T224188 rack/setup/install (3) new osd ceph nodes
Resolved	• JHedden	T236290 Deploy a Ceph testing environment using Rook.io on VMs
Resolved	• JHedden	T236819 Identify container images and packages required for rook.io and ceph
Resolved	• JHedden	T239918 Deploy Ceph Nautilus on Buster
Resolved	• JHedden	T239917 Import Buster packages for Ceph Nautilus
Resolved	• JHedden	T240021 Reimage cloudceph mon and osd hosts
Resolved	• JHedden	T240715 Configure prometheus monitoring for Ceph
Resolved	• JHedden	T240718 Perform failover tests on Ceph storage cluster
Resolved	• JHedden	T240722 Fix Icingia disk space check on cloudcephosd100[1-3] servers
Resolved	• JHedden	T240965 Enable private network interface on Ceph OSD and MON hosts
Resolved	• JHedden	T243327 Test virtual machine migrations using Ceph based storage
Resolved	• JHedden	T244868 Allow nova instance extra_specs to be updated on existing virtual machines

	RobH
	May 22 2019, 8:59 PM

rack/setup/install (3) new osd ceph nodesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

rack/setup/install (3) new osd ceph nodes
Closed, ResolvedPublic
Actions

Related Objects
Search...