Page MenuHomePhabricator

Cloud Ceph outage 2023-02-13
Closed, ResolvedPublicBUG REPORT

Description

On 2023-02-13

Incident 1

Cause

At around 14:30 UTC @dcaro takes down two OSD hosts (cloudceph1001/1002) to be moved to the rack e4, this sets some placement groups as read only, causing some VMs to fail some writes to disk, and many VMs to get stuck and be marked as down.

Resolution

This was fixed ~ 10 minutes later by allowing the cluster to rebalance (ceph osd unset norebalance + ceph osd unset noout), that started shifting data and creating the missing replicas, restoring the placement groups.

At the same time, the hosts were moved and were ready to be reimaged, but to do so they needed some extra configuration to be set in the switches ports.

Incident 2

Cause

The new ports were configured without specifying an MTU, and not yet set as up, and that seemed to trigger an issue with the Juniper switch in which the rest of the ports in the same VLAN would intermittently drop jumbo packets (MTU>1500) (more details here https://phabricator.wikimedia.org/T329535#8612670).

A few minutes later, around 16:30 UTC we had a total outage of the Cloud Ceph cluster (cloudceph*.eqiad.wmnet). The OSD daemons were flagging other OSD hosts as down, and the monitor nodes were forcing them to stop (due to health probes with MTU>1500 failing).

Immediate measures taken

To prevent any data corruption, we did shut down all the OpenStack hypervisors (cloudvirt*.eqiad.wmnet), effectively turning off Cloud VPS and its related services (Toolforge, etc.).

Resolution

The fix was to remove the configuration for those new ports (note that they were never up), and manually starting all the osd daemons in the cluster. That eventually brought the cluster back up and running.

Followed by powering up all the hypervisors, and making sure that the VMs were starting correctly (see followup tickets for details)

See also the comments below for more details.

Related incidents

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedayounsi
Resolvedayounsi
Resolvedayounsi
OpenNone
Resolvedcmooney
Resolveddcaro
Resolved nskaggs
ResolvedNone
Resolvedtaavi
ResolvedBUG REPORTLydia_Pintscher
ResolvedBUG REPORTAndrew
ResolvedBUG REPORTtaavi
Resolvedaborrero
ResolvedBUG REPORTLadsgroup
ResolvedBUG REPORTdcaro
Resolveddcaro
Resolveddcaro
Opendcaro
ResolvedRequestPapaul
Resolvedcmooney

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
19:48:06 <andrewbogott> Seems like we were bitten by a bug in the switch software. Ceph is now gradually recovering from the storm but we likely won't restart VMs until it's fully done with rebuilding.  Better downtime than corruption.

Yeah that seems likely. The issues started soon after I added the config for new ports 16-19 on cloudsw1-e4-eqiad. These ports connect cloudcephosd1001 and cloudcephosd1002, which were being moved from other racks. Cross-row moves are slightly out of process so I was working in Netbox manually, and made an error by not setting any MTU on the switch ports when creating them. I then pushed the changes with Homer (diff can be seen here).

Shortly after, at approx 16:26, problems were reported with the ceph cluster. When we initially checked most network tests were ok, but eventually we found that some large packets were not making it through. Pings would not work and then suddenly start again between certain hosts (see example). After some time we got the output below, which pointed squarely at an issue with cloudsw1-e1-eqiad:

cmooney@cloudcephosd1016:~$ ping -4 cloudcephosd1025 -s 8952 -M do
PING cloudcephosd1025.eqiad.wmnet (10.64.148.2) 8952(8980) bytes of data.
From irb-1108.cloudsw1-e4-eqiad.eqiad.wmnet (10.64.147.1) icmp_seq=10 Frag needed and DF set (mtu = 1500)
ping: local error: Message too long, mtu=1500

The first message here is because the switch sent an ICMP packet back to the host saying "fragmentation needed", supposedly as its irb.1108 interface MTU was 1500. But that interface reported the expected MTU of 9178 when we checked (see here). The next line is due to Linux caching the low reported MTU for the remote IP, and subsequently being unable to send a large packet to it.

Some Juniper documentation (for MX series, not QFX switch we have here), says that the IRB interface MTU is automatically set to the lowest of the attached physical interfaces. To an extent that would explain what happened here. But the fact the newly configured ports (with low MTU) were hard down, the issue was not constant (a majority of large pings got through), and the switch itself reported a high MTU, all make it feel more like a bug.

Eventually, at approx 17:24, I completely removed the config for the 4 new ports. Following this things seemed to slowly stabilize and the cluster started coming back. In retrospect I should have done that much earlier, I placed too much emphasis on the fact the ports were hard down, assuming they could not affect things in that state. Had the fault been consistent, or the IRB interface reported a low MTU we'd also have caught it much earlier, the inconsistency made it a tricky one.

I'll follow up tomorrow and log a TAC case with Juniper to try to shed more light on what happened anyway.

In terms of the original work the required ports have been re-added now with the correct MTU, and all seems stable. I used the server provision script this time, which is good protection against the kind of manual error that kicked it all off.

bd808 changed the subtype of this task from "Task" to "Bug Report".Feb 13 2023, 11:49 PM

Thanks for working hard on this to get it sorted out and sharing the root cause analysis. :)

dcaro lowered the priority of this task from Unbreak Now! to High.Feb 14 2023, 10:06 AM
dcaro updated the task description. (Show Details)
dcaro updated the task description. (Show Details)

Change 824202 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] node_pinger: use jumbo frames

https://gerrit.wikimedia.org/r/824202

Change 889635 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Default L2 interfaces to MTU 9212 if not set from Netbox

https://gerrit.wikimedia.org/r/889635

Hi. I'm writing Tech News, and I am not sure how to concisely and simply explain the problem that this task covers (per the User-notice tag). Please could you suggest some wording? Thanks! For example:
For problems, we usually write something along the lines of:

Last week, for ~15 minutes, some users were unable to log in or edit pages. This was caused by a problem with session storage. [1]

My best guess for this task is along the lines of:

Last week, services hosted on Cloud VPS and Toolforge were unavailable for X-Y (12-36?) hours. This was caused by a problem with the hardware during an upgrade. All related software was turned off to prevent data corruption. [1]

Hi. I'm writing Tech News, and I am not sure how to concisely and simply explain the problem that this task covers (per the User-notice tag). Please could you suggest some wording? Thanks! For example:
For problems, we usually write something along the lines of:

Last week, for ~15 minutes, some users were unable to log in or edit pages. This was caused by a problem with session storage. [1]

My best guess for this task is along the lines of:

Last week, services hosted on Cloud VPS and Toolforge were unavailable for X-Y (12-36?) hours. This was caused by a problem with the hardware during an upgrade. All related software was turned off to prevent data corruption. [1]

That one looks nice, maybe a bit more specific like:

Last week, services hosted on Cloud VPS including Toolforge and PAWS were unavailable for 2-3 hours. This was caused by an unexpected behavior of the network switches that affected the stability of the underlying storage service. All related VMs were turned off to prevent data corruption. An effort has started to prevent and minimize this kind of incidents in the future, more details here <link to this task>.

What do you think?

Hi. I'm writing Tech News, and I am not sure how to concisely and simply explain the problem that this task covers (per the User-notice tag). Please could you suggest some wording? Thanks! For example:
For problems, we usually write something along the lines of:

Last week, for ~15 minutes, some users were unable to log in or edit pages. This was caused by a problem with session storage. [1]

My best guess for this task is along the lines of:

Last week, services hosted on Cloud VPS and Toolforge were unavailable for X-Y (12-36?) hours. This was caused by a problem with the hardware during an upgrade. All related software was turned off to prevent data corruption. [1]

That one looks nice, maybe a bit more specific like:

Last week, services hosted on Cloud VPS including Toolforge and PAWS were unavailable for 2-3 hours. This was caused by an unexpected behavior of the network switches that affected the stability of the underlying storage service. All related VMs were turned off to prevent data corruption. An effort has started to prevent and minimize this kind of incidents in the future, more details here <link to this task>.

What do you think?

Generally speaking, there is no reason for such level of details at the Tech News level. Also root causing is difficult enough as it is, pointing out root causes in an announcement isn't always prudent. The most appropriate verbiage I 've seen is something along the lines "during planned maintenance, unforeseen complications forced the team to <insert action>. The resulting downtime for services A, B and C lasted <start to end>. Actions are already been taken to avoid such incidents in the future, more details at link to task"

Yup, the more simple and short, the better (for ease of translators, and ESL folks, and non-tech folks). I've summarized it as:

  • Last week, during planned maintenance of Cloud Services, unforeseen complications forced the team to turn off all tools for 2–3 hours to prevent data corruption. Work is ongoing to prevent similar problems in the future. [ 1 ]

Let me know (or edit) if any of that is inaccurate. The edition will be frozen for translation in ~3 hours. Thanks!

Yup, the more simple and short, the better (for ease of translators, and ESL folks, and non-tech folks). I've summarized it as:

The problem is when things are too short and without sufficient context, it’s actually more difficult to translate. In this specific case, had I not read this ticket I would not have guessed what “tools” was supposed to mean (it can mean a range of things, usually not this one).

I'll close this for now, it might be related to T348643: cloudcephosd1021-1034: hard drive sector errors increasing as that could explain the cluster slowness when having to move data around after a host goes down.

I think that there's no more direct tasks to be worked from this besides the already ongoing ones.

I'll close this for now

@dcaro: But you didn't? :D