We want to understand and observe the impact that a network switch down has on cloud, under controlled conditions. The results will give us a better idea on how to proceed with {T414835} and see how far we've come to address T375204: [cloudceph] Improve downtime when a switch goes down.
The main driving force behind these tests is ceph failure scenarios and resiliency, though considering cloud as a whole is worthwhile. There are of course a spectrum of possibilities for the tests: from simply rebooting the switch and observe the effects, to shutting down progressively more ports, to maybe something else I'm forgetting now (?)
I have reviewed the rack allocation (P88809) and I think a good candidate to start with is C8: there are no cloudvirts, relatively few ceph TB compared to the rest (150) so in theory the impact should be zero/minimal.
Questions I have in mind:
- To what extent shutting the individual ports differs from the switch rebooting? In terms of what other hosts on the network experience that is. What I'm getting at here is whether we can realistically and progressively simulate a switch rebooting without doing it all at once.
- For non-ceph hosts in C8 (namely control, gw, lb, net, rabbit, services) is automatic failover and/or minimal impact expected on switch reboot?
For 1. I'm cc'ing @ayounsi and @cmooney to help answer, whereas for 2. maybe @taavi @Andrew you have ideas/insights ?
Specifically for C8 these are the hosts in service, broken down by "failover status"
Manual failover / maint mode
Needs maintenance mode and/or manual failover (e.g. ceph noout)
cloudcephmon1004.eqiad.wmnet
cloudcephosd1016.eqiad.wmnet
cloudcephosd1017.eqiad.wmnet
cloudcephosd1018.eqiad.wmnet
cloudcephosd1021.eqiad.wmnet
cloudcephosd1022.eqiad.wmnet
cloudcephosd1035.eqiad.wmnet
cloudcephosd1042.eqiad.wmnet
cloudcephosd1043.eqiad.wmnet
Automatic failover
Will failover automatically, with some/no user impact
cloudgw1003.eqiad.wmnet
cloudlb1001.eqiad.wmnet
cloudnet1005.eqiad.wmnet
cloudservices1006.eqiad.wmnet
cloudcontrol1011.eqiad.wmnet
cloudrabbit1001.eqiad.wmnet
N/A - no failover / no user impact
No failover required/needed though no immediate user impact either
cloudbackup1003.eqiad.wmnet
Testing plan
We'll be testing a "switch reboot" scenario by progressively shutting interfaces on the C8 switch side and assess impact on services.
ceph
Ahead of the work we'll be setting the ceph cluster as ceph osd set noout to prevent data rebalance, then start with shutting one OSD and assess impact. Continue with more OSDs if no impact, then shut mon too and assess for impact. This is the most important part of the test as ceph rebalance has been historically the cause for cloud switch reboots being "scary"
gw/lb/net
These hosts are meant to be stateless by design, we'll be shutting one after the other and assess impact.
services/control/rabbit
These hosts are stateful and at least the rabbit/openstack interaction is known to be less than failure resistant (T418444). We'll also be shutting one interface after the other and assess impact
