Cloud Ceph outage 2023-02-13
Closed, ResolvedPublicBUG REPORT
Actions

Description

On 2023-02-13

Incident 1

Cause

At around 14:30 UTC @dcaro takes down two OSD hosts (cloudceph1001/1002) to be moved to the rack e4, this sets some placement groups as read only, causing some VMs to fail some writes to disk, and many VMs to get stuck and be marked as down.

Resolution

This was fixed ~ 10 minutes later by allowing the cluster to rebalance (ceph osd unset norebalance + ceph osd unset noout), that started shifting data and creating the missing replicas, restoring the placement groups.

At the same time, the hosts were moved and were ready to be reimaged, but to do so they needed some extra configuration to be set in the switches ports.

Incident 2

Cause

The new ports were configured without specifying an MTU, and not yet set as up, and that seemed to trigger an issue with the Juniper switch in which the rest of the ports in the same VLAN would intermittently drop jumbo packets (MTU>1500) (more details here https://phabricator.wikimedia.org/T329535#8612670).

A few minutes later, around 16:30 UTC we had a total outage of the Cloud Ceph cluster (cloudceph*.eqiad.wmnet). The OSD daemons were flagging other OSD hosts as down, and the monitor nodes were forcing them to stop (due to health probes with MTU>1500 failing).

Immediate measures taken

To prevent any data corruption, we did shut down all the OpenStack hypervisors (cloudvirt*.eqiad.wmnet), effectively turning off Cloud VPS and its related services (Toolforge, etc.).

Resolution

The fix was to remove the configuration for those new ports (note that they were never up), and manually starting all the osd daemons in the cluster. That eventually brought the cluster back up and running.

Followed by powering up all the hypervisors, and making sure that the VMs were starting correctly (see followup tickets for details)

See also the comments below for more details.

Related incidents

T314870: Setup cloudcephosd10[25-34] into the ceph eqiad cluster specifically T314870#8161673

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T253824 planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1)
Resolved		ayounsi	T254013 all network devices must run OpenSSH >= 7.2p1 but != 7.4p1
Resolved		ayounsi	T317175 Junos: resolve DNS through mgmt_junos
Resolved		ayounsi	T327862 Use mgmt_junos on all network devices
			Restricted Task
Open		None	T316539 Upgrade network devices to Junos 20+
Resolved		cmooney	T316544 Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+
Resolved		dcaro	T297083 [ceph] Getting rack level HA
Resolved		• nskaggs	T329498 [ceph] Move cloudcephosd1001 (b7) and cloudcephosd1002 (b4) to rack e4
Resolved		None	T329592 beta cluster down
Resolved		taavi	T329581 PAWS down
Resolved	BUG REPORT	Lydia_Pintscher	T329555 WDQS tutorial on toolforge is down
Resolved	BUG REPORT	Andrew	T329590 grafana.wmcloud.org offline following cloud wide outage
Resolved	BUG REPORT	taavi	T329589 gerrit copy of cloud/instance-puppet stopped replicating
Resolved		aborrero	T329611 Toolforge grid: start webservices after outage
Resolved	BUG REPORT	Ladsgroup	T329934 Wikimedia Chat (Mattermost instance) is down
Resolved	BUG REPORT	dcaro	T329535 Cloud Ceph outage 2023-02-13
Resolved		dcaro	T329709 [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner
Resolved		dcaro	T329711 [ceph] Add monitoring for inter-osd/mon/cloudvirt connectivity
Open		dcaro	T329778 [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network
Resolved	Request	Papaul	T330754 hw troubleshooting: Link hard down (probably cable) for cloudcephosd2002-dev.codfw.wmnet
Resolved		cmooney	T329799 Add network-layer protections to avoid inadvertently lowering IRB MTU

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

taavi merged a task: T329566: Panoviewer is offline (HTTP 502 Bad Gateway).Feb 13 2023, 7:48 PM

taavi added a subscriber: valerio.bozzolan.

Ameisenigel mentioned this in T329444: Issues with Gerrit test instance.Feb 13 2023, 10:00 PM

In T329535#8611616, @taavi wrote:

19:48:06 <andrewbogott> Seems like we were bitten by a bug in the switch software. Ceph is now gradually recovering from the storm but we likely won't restart VMs until it's fully done with rebuilding.  Better downtime than corruption.

Yeah that seems likely. The issues started soon after I added the config for new ports 16-19 on cloudsw1-e4-eqiad. These ports connect cloudcephosd1001 and cloudcephosd1002, which were being moved from other racks. Cross-row moves are slightly out of process so I was working in Netbox manually, and made an error by not setting any MTU on the switch ports when creating them. I then pushed the changes with Homer (diff can be seen here).

Shortly after, at approx 16:26, problems were reported with the ceph cluster. When we initially checked most network tests were ok, but eventually we found that some large packets were not making it through. Pings would not work and then suddenly start again between certain hosts (see example). After some time we got the output below, which pointed squarely at an issue with cloudsw1-e1-eqiad:

cmooney@cloudcephosd1016:~$ ping -4 cloudcephosd1025 -s 8952 -M do
PING cloudcephosd1025.eqiad.wmnet (10.64.148.2) 8952(8980) bytes of data.
From irb-1108.cloudsw1-e4-eqiad.eqiad.wmnet (10.64.147.1) icmp_seq=10 Frag needed and DF set (mtu = 1500)
ping: local error: Message too long, mtu=1500

The first message here is because the switch sent an ICMP packet back to the host saying "fragmentation needed", supposedly as its irb.1108 interface MTU was 1500. But that interface reported the expected MTU of 9178 when we checked (see here). The next line is due to Linux caching the low reported MTU for the remote IP, and subsequently being unable to send a large packet to it.

Some Juniper documentation (for MX series, not QFX switch we have here), says that the IRB interface MTU is automatically set to the lowest of the attached physical interfaces. To an extent that would explain what happened here. But the fact the newly configured ports (with low MTU) were hard down, the issue was not constant (a majority of large pings got through), and the switch itself reported a high MTU, all make it feel more like a bug.

Eventually, at approx 17:24, I completely removed the config for the 4 new ports. Following this things seemed to slowly stabilize and the cluster started coming back. In retrospect I should have done that much earlier, I placed too much emphasis on the fact the ports were hard down, assuming they could not affect things in that state. Had the fault been consistent, or the IRB interface reported a low MTU we'd also have caught it much earlier, the inconsistency made it a tricky one.

I'll follow up tomorrow and log a TAC case with Juniper to try to shed more light on what happened anyway.

In terms of the original work the required ports have been re-added now with the correct MTU, and all seems stable. I used the server provision script this time, which is good protection against the kind of manual error that kicked it all off.

rook mentioned this in T329581: PAWS down.Feb 13 2023, 11:05 PM

Zabe mentioned this in T329577: deployment-db10 databases are broken.Feb 13 2023, 11:08 PM

bd808 changed the subtype of this task from "Task" to "Bug Report".Feb 13 2023, 11:49 PM

Zabe mentioned this in T329589: gerrit copy of cloud/instance-puppet stopped replicating.Feb 13 2023, 11:51 PM

Zabe added a subtask: T329589: gerrit copy of cloud/instance-puppet stopped replicating.

Zabe mentioned this in T329592: beta cluster down.Feb 14 2023, 12:13 AM

zhuyifei1999 subscribed.Feb 14 2023, 12:21 AM

Zabe added a subtask: T329592: beta cluster down.Feb 14 2023, 1:27 AM

Zabe subscribed.

Frostly subscribed.Feb 14 2023, 3:43 AM

Fuzheado subscribed.Feb 14 2023, 5:30 AM

taavi closed subtask T329590: grafana.wmcloud.org offline following cloud wide outage as Resolved.Feb 14 2023, 8:08 AM

taavi closed subtask T329589: gerrit copy of cloud/instance-puppet stopped replicating as Resolved.Feb 14 2023, 8:10 AM

Thanks for working hard on this to get it sorted out and sharing the root cause analysis. :)

aborrero mentioned this in T329611: Toolforge grid: start webservices after outage.Feb 14 2023, 9:51 AM

dcaro added a parent task: T329498: [ceph] Move cloudcephosd1001 (b7) and cloudcephosd1002 (b4) to rack e4 .Feb 14 2023, 9:55 AM

dcaro lowered the priority of this task from Unbreak Now! to High.Feb 14 2023, 10:06 AM

dcaro updated the task description. (Show Details)

dcaro updated the task description. (Show Details)Feb 14 2023, 10:20 AM

dcaro updated the task description. (Show Details)Feb 14 2023, 10:23 AM

dcaro updated the task description. (Show Details)

Nintendofan885 subscribed.Feb 14 2023, 12:24 PM

Nintendofan885 added a subtask: T329555: WDQS tutorial on toolforge is down.Feb 14 2023, 12:27 PM

Fuzheado added a subtask: T329581: PAWS down.Feb 14 2023, 12:29 PM

Marostegui subscribed.Feb 14 2023, 12:36 PM

ayounsi subscribed.Feb 14 2023, 12:51 PM

Change 824202 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] node_pinger: use jumbo frames

https://gerrit.wikimedia.org/r/824202

gerritbot added a project: Patch-For-Review.Feb 14 2023, 1:08 PM

aborrero closed subtask T329611: Toolforge grid: start webservices after outage as Resolved.Feb 14 2023, 1:26 PM

BTullis subscribed.Feb 14 2023, 1:39 PM

Jdforrester-WMF removed a subtask: T329592: beta cluster down.Feb 14 2023, 5:42 PM

Jdforrester-WMF added a parent task: T329592: beta cluster down.

Jdforrester-WMF removed a subtask: T329581: PAWS down.

Jdforrester-WMF added a parent task: T329581: PAWS down.

Jdforrester-WMF removed a subtask: T329555: WDQS tutorial on toolforge is down.

Jdforrester-WMF added a parent task: T329555: WDQS tutorial on toolforge is down.

Jdforrester-WMF removed a subtask: T329590: grafana.wmcloud.org offline following cloud wide outage.

Jdforrester-WMF added a parent task: T329590: grafana.wmcloud.org offline following cloud wide outage.

Jdforrester-WMF removed a subtask: T329589: gerrit copy of cloud/instance-puppet stopped replicating.Feb 14 2023, 5:44 PM

Jdforrester-WMF added a parent task: T329589: gerrit copy of cloud/instance-puppet stopped replicating.

Jdforrester-WMF removed a subtask: T329611: Toolforge grid: start webservices after outage.

Jdforrester-WMF added a parent task: T329611: Toolforge grid: start webservices after outage.

dancy subscribed.Feb 14 2023, 6:26 PM

Shizhao added a project: User-notice.Feb 15 2023, 3:45 AM

dcaro added a subtask: T329709: [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner.Feb 15 2023, 10:01 AM

dcaro changed the status of subtask T329709: [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner from Open to In Progress.

dcaro added a subtask: T329711: [ceph] Add monitoring for inter-osd/mon/cloudvirt connectivity.Feb 15 2023, 10:06 AM

dcaro changed the status of subtask T329711: [ceph] Add monitoring for inter-osd/mon/cloudvirt connectivity from Open to In Progress.

fnegri mentioned this in T301949: ToolsDB upgrade => Bullseye, MariaDB 10.4.Feb 15 2023, 10:54 AM

dcaro updated the task description. (Show Details)Feb 15 2023, 4:43 PM

dcaro updated the task description. (Show Details)Feb 15 2023, 5:17 PM

dcaro added a subtask: T329778: [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network.Feb 15 2023, 5:44 PM

Change 889635 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Default L2 interfaces to MTU 9212 if not set from Netbox

https://gerrit.wikimedia.org/r/889635

cmooney added a subtask: T329799: Add network-layer protections to avoid inadvertently lowering IRB MTU.Feb 15 2023, 9:59 PM

taavi mentioned this in T329853: Trove database vanished.Feb 16 2023, 8:24 PM

Hi. I'm writing Tech News, and I am not sure how to concisely and simply explain the problem that this task covers (per the User-notice tag). Please could you suggest some wording? Thanks! For example:
For problems, we usually write something along the lines of:

Last week, for ~15 minutes, some users were unable to log in or edit pages. This was caused by a problem with session storage. [1]

My best guess for this task is along the lines of:

Last week, services hosted on Cloud VPS and Toolforge were unavailable for X-Y (12-36?) hours. This was caused by a problem with the hardware during an upgrade. All related software was turned off to prevent data corruption. [1]

Fnielsen unsubscribed.Feb 17 2023, 8:09 AM

In T329535#8624093, @Quiddity wrote:

Hi. I'm writing Tech News, and I am not sure how to concisely and simply explain the problem that this task covers (per the User-notice tag). Please could you suggest some wording? Thanks! For example:
For problems, we usually write something along the lines of:

Last week, for ~15 minutes, some users were unable to log in or edit pages. This was caused by a problem with session storage. [1]

My best guess for this task is along the lines of:

Last week, services hosted on Cloud VPS and Toolforge were unavailable for X-Y (12-36?) hours. This was caused by a problem with the hardware during an upgrade. All related software was turned off to prevent data corruption. [1]

That one looks nice, maybe a bit more specific like:

Last week, services hosted on Cloud VPS including Toolforge and PAWS were unavailable for 2-3 hours. This was caused by an unexpected behavior of the network switches that affected the stability of the underlying storage service. All related VMs were turned off to prevent data corruption. An effort has started to prevent and minimize this kind of incidents in the future, more details here <link to this task>.

What do you think?

Zblace added a parent task: T329934: Wikimedia Chat (Mattermost instance) is down.Feb 17 2023, 10:28 AM

Zblace mentioned this in T329934: Wikimedia Chat (Mattermost instance) is down.

In T329535#8624638, @dcaro wrote:

In T329535#8624093, @Quiddity wrote:

Hi. I'm writing Tech News, and I am not sure how to concisely and simply explain the problem that this task covers (per the User-notice tag). Please could you suggest some wording? Thanks! For example:
For problems, we usually write something along the lines of:

Last week, for ~15 minutes, some users were unable to log in or edit pages. This was caused by a problem with session storage. [1]

My best guess for this task is along the lines of:

Last week, services hosted on Cloud VPS and Toolforge were unavailable for X-Y (12-36?) hours. This was caused by a problem with the hardware during an upgrade. All related software was turned off to prevent data corruption. [1]

That one looks nice, maybe a bit more specific like:

Last week, services hosted on Cloud VPS including Toolforge and PAWS were unavailable for 2-3 hours. This was caused by an unexpected behavior of the network switches that affected the stability of the underlying storage service. All related VMs were turned off to prevent data corruption. An effort has started to prevent and minimize this kind of incidents in the future, more details here <link to this task>.

What do you think?

Generally speaking, there is no reason for such level of details at the Tech News level. Also root causing is difficult enough as it is, pointing out root causes in an announcement isn't always prudent. The most appropriate verbiage I 've seen is something along the lines "during planned maintenance, unforeseen complications forced the team to <insert action>. The resulting downtime for services A, B and C lasted <start to end>. Actions are already been taken to avoid such incidents in the future, more details at link to task"

fnegri mentioned this in T329949: [Cloud VPS] Trove dbs do not restart after a hypervisor restart.Feb 17 2023, 4:02 PM

fnegri mentioned this in T329970: [toolsdb] set up tools-db-1 to replicate from clouddb1001.Feb 17 2023, 6:10 PM

Yup, the more simple and short, the better (for ease of translators, and ESL folks, and non-tech folks). I've summarized it as:

Last week, during planned maintenance of Cloud Services, unforeseen complications forced the team to turn off all tools for 2–3 hours to prevent data corruption. Work is ongoing to prevent similar problems in the future. [ 1 ]

Let me know (or edit) if any of that is inaccurate. The edition will be frozen for translation in ~3 hours. Thanks!

Quiddity moved this task from To Triage to In current Tech/News draft on the User-notice board.Feb 17 2023, 8:15 PM

Umherirrender mentioned this in T329986: https://libraryupgrader2.wmcloud.org/ gives 500 Internal Server Error.Feb 18 2023, 12:22 AM

In T329535#8626230, @Quiddity wrote:

Yup, the more simple and short, the better (for ease of translators, and ESL folks, and non-tech folks). I've summarized it as:

The problem is when things are too short and without sufficient context, it’s actually more difficult to translate. In this specific case, had I not read this ticket I would not have guessed what “tools” was supposed to mean (it can mean a range of things, usually not this one).

Quiddity moved this task from In current Tech/News draft to Already announced/Archive on the User-notice board.Feb 23 2023, 11:04 PM

dcaro closed subtask T329711: [ceph] Add monitoring for inter-osd/mon/cloudvirt connectivity as Resolved.Apr 11 2023, 9:46 AM

fnegri moved this task from FY2022/2023-Q3 to FY2022/2023-Q4 on the cloud-services-team board.Apr 12 2023, 3:00 PM

fnegri edited projects, added cloud-services-team (FY2022/2023-Q4); removed cloud-services-team (FY2022/2023-Q3).

fnegri moved this task from Backlog to Planned Goals on the cloud-services-team (FY2022/2023-Q4) board.

Frostly unsubscribed.Apr 13 2023, 2:22 AM

fnegri added a project: Goal.Apr 19 2023, 2:28 PM

fnegri moved this task from Planned Goals to In progress on the cloud-services-team (FY2022/2023-Q4) board.

cmooney closed subtask T329799: Add network-layer protections to avoid inadvertently lowering IRB MTU as Resolved.May 2 2023, 10:44 AM

fnegri moved this task from FY2022/2023-Q4 to FY2023/2024-Q1-Q2 on the cloud-services-team board.Jul 26 2023, 5:48 PM

fnegri edited projects, added cloud-services-team (FY2023/2024-Q1-Q2); removed cloud-services-team (FY2022/2023-Q4).

fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2023/2024-Q1-Q2) board.

I'll close this for now, it might be related to T348643: cloudcephosd1021-1034: hard drive sector errors increasing as that could explain the cluster slowness when having to move data around after a host goes down.

I think that there's no more direct tasks to be worked from this besides the already ongoing ones.

In T329535#9246104, @dcaro wrote:

I'll close this for now

@dcaro: But you didn't? :D

:facepalm:

fnegri moved this task from In progress to Done on the cloud-services-team (FY2023/2024-Q1-Q2) board.Oct 27 2023, 5:04 PM

Maintenance_bot edited projects, added User-notice-archive; removed User-notice.Nov 6 2023, 5:30 PM

dcaro closed subtask T329709: [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner as Resolved.Sep 17 2024, 8:17 AM

dcaro mentioned this in T375204: [cloudceph] Improve downtime when a switch goes down.Sep 19 2024, 2:30 PM

	aborrero
	Feb 13 2023, 4:54 PM

Cloud Ceph outage 2023-02-13Closed, ResolvedPublicBUG REPORTActions

Description

Incident 1

Cause

Resolution

Incident 2

Cause

Immediate measures taken

Resolution

Related incidents

Related ObjectsSearch...

Event Timeline

Cloud Ceph outage 2023-02-13
Closed, ResolvedPublicBUG REPORT
Actions

Related Objects
Search...