Page MenuHomePhabricator

rack/setup/install (3) new osd ceph nodes
Open, Needs TriagePublic

Description

This task will track the naming, racking, and setup of three new Ceph/OSD proof of concept nodes. This task has been filed WELL in advance of the hardware arrival, as there are likely network and racking considerations for discussion before racking location is determined.

Racking Location: All hosts in Row B, one host per rack to maintain some redundancy. Each rack should have one of these servers and one of the monitor servers from T228102

Hostname: cloudcephosd1001.wikimedia.org, cloudcephosd1002.wikimedia.org, cloudcephosd1003.wikimedia.org for public interfaces

Networking note: One 10g ethernet network connection to the public subnet (wikimedia.org) and one to the private, internal (eqiad.wmnet) should be on each host.

System #1 cloudcephosd1001:

  • - receive in system on procurement task T221698
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan) (ensure from comments where networking/vlan needs to be and if it needs more than one interface connected)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

System #2 cloudcephosd1002:

  • - receive in system on procurement task T221698
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan) (ensure from comments where networking/vlan needs to be and if it needs more than one interface connected)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

System #3 cloudcephosd1003:

  • - receive in system on procurement task T221698
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan) (ensure from comments where networking/vlan needs to be and if it needs more than one interface connected)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Event Timeline

RobH created this task.May 22 2019, 8:59 PM
RobH mentioned this in Unknown Object (Task).
RobH assigned this task to Andrew.EditedMay 22 2019, 9:08 PM
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.

@Andrew or @Bstorm: Since you both were commenting on the hardware specification task, I'm assuming you would also be the ones to ask about the networking requirements/vlans for these systems as well as the redundancy requirements?

  • What will be the hostnames for these systems? Update https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions if its new.
  • How many interfaces need to be network connected and to what vlans?
    • Ideally we can locate these outside of row B, since row B 10G is VERY crowded due to cloudvirts and other row B restricted hosts.
  • Can any of these systems share a rack or are they horizontally redundant between the three and need different racks?
  • I vote to add cloudosd1xxx to the naming conventions unless my team rebels against that. The related monitor nodes would end up cloudmon1xxx. They are more possibly multi-purpose, but they'll be primary monitors for Ceph. Since this is the PoC, we can always revisit that in the future.
  • I'll get back to you soon on the network placement and so forth.
  • They will be horizontally redundant between the three and should have different racks.

I can say after a bit more research and experimentation that it would be good to be able to split between a data network that connects it with the cloudvirts and friends and a "private" management network that routes traffic for things like rebuilds of the storage devices to prevent such an event from filling the network to the cloudvirts (entirely possible on 10G). The three monitor hosts will need to be connected the same way, however we do it. I'll keep doing more homework on the network, but for now, I've got that (however possible or not that is). @Andrew out of curiosity, to get around the row B restriction, this basically looks like putting the Ceph "public" network on the actual public network, right? Then the backend traffic could be wired to the private network, since that is all intra-cluster. (Ceph calls it the "private" network).

Bstorm added a comment.Jun 7 2019, 6:24 PM

So for network: if possible, can we do one port on the public lan and one port on the private? @RobH
Everything else in my first comment should be right.

Bstorm added a comment.Jun 7 2019, 6:26 PM

Note: the hosts aren't expected to work as routers. They just should have management traffic separated if we can properly run them through the paces. We can probably make things work without that if we have to.

RobH reassigned this task from Andrew to ayounsi.Jun 7 2019, 6:33 PM
RobH added subscribers: ayounsi, MoritzMuehlenhoff.

Ok, I've synced up with @Bstorm via IRC, and we have the following questions to be addressed by our network admin(s) to ensure we aren't breaking any rules:

  • These hosts need to have a public interface for users to hit, so the primary interface needs to be in the public1-<row letter>-eqiad vlan/subnet.
  • The hosts also need to push private CEPH traffic between nodes, and will need their secondary interfaces connected to the private1-<row letter>-eqiad vlan/subnet.
    • The above would allow these 3 hosts to be distributed into rows A/C/D and allow for greatest horizontal redundancy.

I'm not sure if we have any other hosts bridging the private1/public1 subnets within a single host, so I'd like to check with @ayounsi or @MoritzMuehlenhoff to ensure this isn't a concern.

If the above isn't a concern, then the racking plan will be:

  • one host in rows A/C/D
  • primary 10G on public1 subnet/vlan
  • secondary 10G on private1 subnet/vlan

There is no security or technical limitation on having 1 interface in each vlan as long as they don't go across rows. Hosts in one vlan can already reach the hosts in the other.
So the question is more about using extra 10G switch ports, increasing configuration complexity, and load (for the cluster in general).

I went through http://docs.ceph.com/docs/mimic/rados/configuration/network-config-ref/
The two reasons stated for that separation are:

  • Performances (replication traffic impacting client traffic)
  • Security (client DoSing the one interface, disrupting health-check like traffic)

Some questions about the above and overall design/flows:
@Bstorm
Is there more doc on the network aspect of CEPH I should read?
How much traffic are you expecting on both networks? (average/peak) As well as cross row?
Is there a risk of saturating the 40G links between racks/rows?
How much performance differences is expected from using 2 interfaces instead of 1?
Who will be the clients? Cloud only or the public internet as well?
Is there a risk/concern that the clients DoS the cluster?
Can user or replication traffic be rate-limited?

Cmjohnson updated the task description. (Show Details)Jun 19 2019, 4:56 PM
Cmjohnson added a subscriber: Cmjohnson.

@Bstorm @ayounsi I will need very clear instructions on which racks/rows these servers can go in before I physically rack and cable. Once that is figured out please update the task.

@ayounsi Ceph docs are vague at best or tend to ask you to read dissertations eventually. Overall, everything comes back to "test it in your cluster and see". Ceph is capable of saturating 10G links under heavy load (and the private link would be able to saturate during node failures for rebuilds). A 40G link would be harder to saturate, but it is theoretically possible. This is a PoC, so my intent is to break it every which way and put it under test loads. We would certainly want to keep an eye on those links during tests (are you able to point me to where I could do that?).

" How much traffic are you expecting on both networks? (average/peak) As well as cross row?" - basically the answer is "What a good question!" We should learn more from this test. This is meant to help define the design of the build-out of a fully operational cluster, so monitoring what it does is important. This will be a fairly low-latency cluster if we are able to design it well, but it will be limited by its 10G link on the public side. The private side traffic will be dictated by traffic on the public and and traffic for rebuilds and reshuffles of data. I expect that side to be smaller but significant, except all that traffic will be between ceph nodes, not with clients.

I can say that there is a very significant performance difference if using 1 interface instead of 2 because Ceph does around as much chatter among itself as it does with clients (unlike NFS). So both links will be as chatty as the clients make them.

The clients will be cloud only, but the public IP is being used to avoid firewall/vlan issues for that connection. It will be providing block devices to the cloudvirts, VMs and likely an NFS replacement service as well. This cluster itself is merely a test/PoC that will remain in production after build-out. It will be closer to 10 nodes in a final form (we think...based on this testing that will happen after we build this test cluster). That is imporant to consider for distributing load in the build-out phase and perhaps that will help inform your thoughts now as well. That's a lot of 10G links, and it is also mostly aimed at cross-row traffic because it will be talking to cloud and ceph.

DoS: I am concerned about DoS from the public side because we are doing the main traffic on that end, but Ceph is authenticated and will be firewalled, so hopefully, that will be mitigated entirely by that. For the most part, if we trust ferm/iptables, we should be safe there except from Cloud clients. We aren't too worried Cloud clients will do that.

Rate-limiting traffic is likely to collapse the cluster. We can experiment with it mostly on the public side during the PoC. On the private side, it is more likely to cause problems. This is meant to operate like a SAN in some ways for the clients, though, so it would cause problems if rate-limited. We'll be testing whatever knobs we have to keep some quality of service available.

I will add that plenty of people build new networks just for Ceph (partly to get jumbo frames). Our loads are somewhat unpredictable, but I'm hoping to huddle with you a bit during the PoC so we can talk about how our tests and loads affect things.

Does that help? I didn't answer in a very orderly fashion, but I tried to cover everything I know so far.

Ceph is capable of saturating 10G links under heavy load
[...]
Rate-limiting traffic is likely to collapse the cluster.
[...]
I will add that plenty of people build new networks just for Ceph (partly to get jumbo frames).

So, what's problematic here is that both our switch leaf-spine uplinks, as well as our spine-router uplinks (used for cross-row traffic, say from row A to row B) are 40G. With multiple nodes with 10G links, and given the existing link utilization (peaking at 8G on 5m avg), this means that about ~3 of them chattering at full speed (either with each other for replication, and/or with their -now 10G- clients) would be enough to start causing datacenter-wide congestion issues for all services, and thus outage-type scenarios (think LVS-to-varnish, appserver-to-database, etc. etc.)

So... that would obviously be a major concern :) It sounds like the intention is to load test with 3-6 10G ports across multiple racks/rows which... we cannot accomodate with our existing network setup without causing widespread issues :(

For both the PoC and the production setup, we need either guarantees for a (relatively small) amount of traffic this will carry (possibly with rate limits? 1G ports? or limiting in specific rack or rows?), or to rethink the network setup for this. Example options being to isolate this into an entirely new network, and/or redo our data center design -- both of which being major and costly endeavours.

(Also, note that no resourcing asks for networking for FY19-20 made the cuts, and a network upgrade from, say, 40G to 100G, requires new router linecards and new spines, and would have a cost in the ballpark of $500-600k and be a 3-4 quarter project, and that's a conservative estimate…)

I don't have bright ideas or easy ways out at this point, and honestly I haven't really thought about all this much. But, this clearly surfaces as an important risk at this point, as it raises "can impact site stability" concerns. These apply for sure to the proposed production setup, but even for the PoC itself given the intention is to put it under (heavy?) test loads. So, apologies for the extra scrutiny and potential setbacks, but I think it's warranted :/

100% agree with you @faidon, and I appreciate the reply. I'm aiming to avoid any sugar-coating in my assessments of risks until I have more data (especially with 40G uplinks that are widely shared), partly to open conversations and make sure we design carefully. After a bit of time to think about this, I have some more thoughts.

I should soften some of my statements a little with the idea that any test that gets close to filling a 10G link would be stopped and noted. The goal is to identify where that could happen and resolve the risk (or plan rollout/scaling accordingly). The network design is certainly a limitation on how much we can scale ceph and how we can place nodes (and I'm learning more about our network design now, which is great!). The general fix for such saturation issues is to add more nodes, but those nodes need to be spread in such a way that they don't endanger uplinks. This is also why fewer, larger nodes can be a problem for the backend/private network because a very large OSD node taking an outage will cause a LOT of backend chatter to rebuild, on a scale that is entirely determined by the size of the node's storage and the speed of the network it has available. Our hope is to understand any conditions that can reach that level and work against them in our setup.

I also think that depending on how we limit the rate of traffic, it could be made workable so that it will not effectively "collapse" the cluster (simply like using a different size link--1G links are not advised for all SSD, obviously, when used at high speed). By collapse, I mean made unstable in a really bad way where things become unusable to clients and the chatter on the backend gets far worse. If we are rate-limiting in the right way (and it calls for experimentation) we should be able to simply extend rebuild times for the back-end network (not just for the public side), but the effects very well could be unexpected on the stability of the cluster--and we would know for certain until we try it with a sensible load because all the blasted docs say "it depends".
I was initially thinking of rate-limiting mostly in terms of how we do it on the NFS hosts (via the servers), and that's really not likely to work at first glance.
It would be really great to discuss what options are available for rate limiting both sides and to test them. If none are available on the network end, I can go digging on the host side (but I'm less positive about that side so far).

This has me thinking a lot about scaling strategies and so forth as well, but I'll save that for another discussion--ditto for the fact that our network interface saturation monitoring is still diamond-based :-p

Once we have a running cluster to beat on and actually see what simulations of our loads actually act like (accepting that our technical contributors are rather creative in their ability to make new loads), we may find that the ceph project might be more limited than we'd prefer. We might also find out that its pretty easy to rate limit and monitor it for safety (which would be really nice). We might even find out that without jumbo frames, our performance is really weak, and it needs to be for certain limited purposes only anyway.

So, if you ask me "is there a risk", I'll definitely say, "yes" because I cannot prove otherwise yet, and people out there in the wild have seen these things happen.
Do I believe we will definitely do that much damage? On the PoC, I would be very unpleasantly surprised if we did, and we will plan to avoid it carefully (including making monitoring the interfaces a hard requirement, following this conversation). On full rollout: I hope to have a better idea of the risk level after the PoC and also hope to have strategies in place to address more possible failure modes including risks to production systems.

@ayounsi Is there a way my team could monitor any potential impact on uplinks during tests (besides the obvious math and drawing pictures of the placement within the network)? If not, I'll want to draw myself a network map of them and follow the way the links are interacting.

Note: there are rate limits that can be set within openstack for this as well...but in some versions, they don't work right at all (they get ignored in some cases https://bugzilla.redhat.com/show_bug.cgi?id=1476830), but this is also things we want to be testing. That won't help back-end stuff, etc either. It's just a note.

our network interface saturation monitoring is still diamond-based

What does that mean?

Is there a way my team could monitor any potential impact on uplinks during tests

The easiest way is to look at LibreNMS, on both cr1 and cr2, ae1 goes to row A, ae2 to row B, ae3 to row C, ae4 to row D.
Note that LibreNMS have a 5min granularity. That mean if a sudden spike of traffic appear, it will not get noticed right away.
We also have alerting for when the link reach 80% utilization, with the same 5min caveat.
A "real time" view exists (eg. https://librenms.wikimedia.org/device/device=2/tab=port/port=139/view=realtime/ ) but it needs to be used carefully to not overwhelm the router's SNMP daemon.

Jumbo frames are enabled everywhere on the switch side, so make sure the proper MTU is set on the host side if you want to use it as a "natural" rate limiter.

Rate limiting on the network side is usually not advised as it have a bad performance hit. Better send the packets slower than create an artificial bottleneck that the sending host have to detect and work around (TCP scaling, etc...).

Only using the public interface could be a way to "naturally" rate limit the cluster (10G total, instead of a theoretical max of 20G per host).

Other option I see would be to keep all the nodes in row B, keeping the impact radius of the cluster miss-behaving to that one row. This also removes the cross row client traffic.

our network interface saturation monitoring is still diamond-based

What does that mean?

Diamond is our old tool to collect local system metrics. It has been superseded by Prometheus mostly, but there are still some remaining use cases in WMCS which aren't migrated yet:
https://phabricator.wikimedia.org/T210993
https://phabricator.wikimedia.org/T210850
https://phabricator.wikimedia.org/T210991

Worth noting that even though we will be using 10G links, we don't expect them to be fully used in any case in the short term.

If we set all the CEPH servers in row B (for the initial PoC) we could avoid saturating any upstream link or device. Only Top-of-rack switches (asw2-b-eqiad).
I know 10G in row B is limited right now, but I don't see any other option.
Rate limiting should be possible both at client and server levels (iptables should do the trick, or alternatively, a similar tc setup like we have right now with NFS). Also possible at network hardware level (i.e, switches) I think. Also, what @ayounsi mentioned could be a good starting point, using only 1x10G in each servers instead of 2x10G.

So here is my proposal:

  • rack each of the 6 new servers (3x mons, 3x OSDs) in different row B racks, using 1x10G links in each server
  • implement any simple rate limiting in case we want to be extra sure that we don't fully use 10G in every server

I'm just trying to find a compromise between our goals and what the network can handle.

That sounds reasonable for the PoC, depending on rack space. @faidon for the last word.

Note that we don't have visibility in the cross virtual chassis links. Adding it to LibreNMS is possible but would require dev time.

That sounds reasonable for the PoC, depending on rack space. @faidon for the last word.
Note that we don't have visibility in the cross virtual chassis links. Adding it to LibreNMS is possible but would require dev time.

Fair. The virtual chassis links, Are they DAC cables or fiber? In any case, how much throughput do they have?

Bstorm added a subscriber: JHedden.EditedJun 24 2019, 2:20 PM

Note that LibreNMS have a 5min granularity. That mean if a sudden spike of traffic appear, it will not get noticed right away.
We also have alerting for when the link reach 80% utilization, with the same 5min caveat.
A "real time" view exists (eg. https://librenms.wikimedia.org/device/device=2/tab=port/port=139/view=realtime/ ) but it needs to be used carefully to not overwhelm the router's SNMP daemon.

Cool, thanks.

Jumbo frames are enabled everywhere on the switch side, so make sure the proper MTU is set on the host side if you want to use it as a "natural" rate limiter.

Oooh! Also, cool. That's one of the strongest recommendations from the ceph community and also our new tech with experience in the area, @JHedden :)

Rate limiting on the network side is usually not advised as it have a bad performance hit. Better send the packets slower than create an artificial bottleneck that the sending host have to detect and work around (TCP scaling, etc...).

I'll have to investigate some of the options outside of tc. There may be interesting blockers on the Openstack side and things like that we find...
Planning (with watermarks about ceph network capacity) is going to be essential.

Only using the public interface could be a way to "naturally" rate limit the cluster (10G total, instead of a theoretical max of 20G per host).

True, but the effect could be large. The backend will respond with the public interfaces, but it also generates a lot of its own traffic. The biggest problem I see there is a build-out phase requiring more hosts than it would otherwise, which goes back to that fewer, big nodes and many, smaller nodes tradeoff issue.

Other option I see would be to keep all the nodes in row B, keeping the impact radius of the cluster miss-behaving to that one row. This also removes the cross row client traffic.

This will likely only work during the PoC phase (because we are going to eat all the ports) and is obviously extending our HA problems down the road. It might be safer for the PoC in case we want to actually try to fill a link and see how hard that is to do (likely very!). I do concur with @aborrero that we are extremely unlikely to exceed the capacity of the links, but the theoretical possibility exists.

I do not like the idea of using only a single 10Gb link on each host if we can possibly avoid it because we will lose visibility into the behavior during our PoC (which eliminates some of our ability to answer these questions in the future) and it expands the ability to inadvertently DoS the cluster (which I do imagine to be possible for our users if all is on one link). We will have more limited information, and it is not best practice. Initial design is the most likely reason the cluster will later have problems, and the PoC should attempt to simulate the rollout where possible.

Had a huddle with @JHedden, actually. He'll add his thoughts soon (with a some info from our existing monitoring).

There's a lot of good information in this task. I'm still catching up, but I wanted to note that it's important to consider the replication factor when designing the network architecture for Ceph. By default Ceph uses synchronous replicated pools, which ensures that data is physically copied to multiple OSDs before sending the acknowledgment to the client. This leads to another benefit of segmenting the public and cluster network traffic. For every single write request on the public network, there are 2 replicated writes on the cluster network.

Using average SATA SSD 500MB/s read and 300MB/s write speeds, the theoretical maximum bandwidth available per storage host is 4,000MB/s read and 2,400MB/s write.

If we could achieve these theoretical numbers the maximum network bandwidth per IO type would look like:
Per storage host:

typetheoretical maxpublic networkcluster network
read4000MB/s32Gb/s0
write2400MB/s19Gb/s38Gb/s

(Max value calculated from drive speed * 8 OSDs)

Aggregated storage cluster bandwidth:

typetheoretical maxpublic networkcluster network
read12,000MB/s96Gb/s0
write7,200MB/s58Gb/s116Gb/s

(Max value calculated from drive speed * 8 OSDs * 3 nodes)

While theoretical is fun, real world is better. To get an idea of what this would look like today, here's some graphite metrics from the last 24 hours on the OpenStack hypervisors.

Total aggregated IO across all hypervisors

typepeak valueest public networkest cluster network
read25MB/s200Mb/s0Mb/s
write9MB/s72Mb/s144Mb/s
total34MB/s272Mb/s144Mb/s

95th percentile aggregated IO across all hypervisors (we have a few noisy VMs)

typepeak valueest public networkest cluster network
read10MB/s80Mb/s0Mb/s
write3.5MB/s28Mb/s56Mb/s
total13.5MB/s108Mb/s56Mb/s

Graphite queries used to collect the data:

  • sumSeries(scaleToSeconds(servers.cloudvirt1*.iostat.*.write_byte_per_second,1))
  • sumSeries(scaleToSeconds(servers.cloudvirt1*.iostat.*.read_byte_per_second,1))
  • percentileOfSeries(scaleToSeconds(servers.cloudvirt1*.iostat.*.write_byte_per_second,1),95)
  • percentileOfSeries(scaleToSeconds(servers.cloudvirt1*.iostat.*.read_byte_per_second,1),95)
Bstorm added a comment.EditedJun 24 2019, 7:41 PM

I should point out that the PoC will not be capable of doing anywhere near that much IO. That would be what it would look like if we managed to convert the entire cluster to Ceph with a full build out. We would not handle the full buildout with three OSDs because of those numbers above. However, that's the throughput the clients would be jostling for on Row B in a fully build-out condition on the frontend network. The backend network and frontend need to be split out so that things are not going through too few pipes.

@JHedden does make the very good point that backend writes are 3x the volume of frontend writes, which is why I think it is a terrible idea to use only one 10G port for each server (that goes for the mon servers as well). If you use a single port, you are quadrupling any actual IO.

Bstorm added a comment.EditedJun 24 2019, 7:52 PM

Ok, that said, I did write that misreading Mbps for Gbps...but what I said is still true! The PoC won't be anywhere near all that, and our full build out is a trickle compared to theoretical limits--and we might even be able to converge the two neworks, but I still dislike the option.

So figuring, based on that data, that it may not be impossible to fill the link, it's extremely unlikely that we will (and we still would love to use jumbo frames), can we put this on other rows?

The above question is aimed at @ayounsi and @faidon.

Per what was decided by WMCS in T228102, the hostname proposal is now cloudcephosd100* for the three. Updating the description with that much at least.

Bstorm updated the task description. (Show Details)Jul 16 2019, 7:48 PM

I had a conversation with @faidon today, and I think the best way to move forward with this particular task is to ask if there is rack space and 10G ports available in Row B not just for these three, but also the three systems in T228102 (assuming they are cabled with one 10G port on public and one 10G port on internal networks for each server). This is so the PoC project can serve to determine precisely what the network needs are in the future so we know how best to proceed then with the future full build.

@RobH would you have that information (whether there's enough room now for the three in this task AND the three in the other task)? If so, we might be able to move forward.

Discussed with @RobH IRC. This is doable as long as it can wait behind some 10G decommissions, which seems fine to me.
Updating the description to capture everything as much as possible.

Bstorm reassigned this task from ayounsi to RobH.Jul 25 2019, 6:37 PM
Bstorm updated the task description. (Show Details)

Note that there are 38 servers using SFP-Ts, which mean using 1G on a 10G switch.

asw2-b-eqiad> show chassis hardware | match SFP-T | count 
Count: 38 lines

Ideally those should be the first ones to move out.

Cmjohnson reassigned this task from RobH to Jclark-ctr.Aug 14 2019, 3:05 PM
Cmjohnson added a subscriber: Jclark-ctr.

@Jclark-ctr can you add asset tags and enter these servers into Netbox (T221698 is the procurement task). Leave them on the floor and the rack information blank in netbox until we know for sure where they're going. Once done, please re-assign back to Rob

RobH added a comment.Aug 14 2019, 3:07 PM

Please do not assign this to me, it is awaiting installation by DC ops into 10G racks, and not on me.

This should be processed by the on-site engineers in eqiad and racked as soon as 10G become available for them.

Change 530246 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for cloudceph10[1-3]

https://gerrit.wikimedia.org/r/530246

Change 530246 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for cloudceph10[1-3]

https://gerrit.wikimedia.org/r/530246

cloudcephosd1001 10.65.2.177
cloudcephosd1002 10.65.2.178
cloudcephosd1003 10.65.2.179

Cmjohnson updated the task description. (Show Details)Aug 15 2019, 12:51 AM

added asset tags updated Netbox

marilerr closed this task as Declined.Sat, Aug 24, 3:20 AM
JJMC89 reopened this task as Open.Sat, Aug 24, 3:21 AM

@Jclark-ctr please rack 1 each in B2/B4/B7 please and update netbox

host	                              row	unit
cloudcephosd1001	b7	27
cloudcephosd1002	b4	12
cloudcephosd1003	b2	12

host row unit port
cloudcephosd1001 b7 27 39/25
cloudcephosd1002 b4 12 43/42
cloudcephosd1003 b2 12 35/13

Jclark-ctr updated the task description. (Show Details)