Page MenuHomePhabricator

Regression in RAID10 software RAID with 6.1.135
Closed, ResolvedPublic

Description

There is a regression in the handling discard/TRIM on RAID 10 software RAID, which leads to soft lockups: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1104460

These hosts are running 6.1.135 with bookworm and RAID10. They needs to be rebooted into linux-image-6.1.0-33-amd64 (6.1.133): Once -33- is running, we can uninstall -34-

  • centrallog2002.codfw.wmnet
  • centrallog1002.eqiad.wmnet
  • vrts2002.codfw.wmnet
  • vrts1003.eqiad.wmnet
  • prometheus2005.codfw.wmnet
  • prometheus2006.codfw.wmnet
  • prometheus2007.codfw.wmnet
  • prometheus2008.codfw.wmnet
  • prometheus1005.eqiad.wmnet
  • prometheus1006.eqiad.wmnet
  • prometheus1007.eqiad.wmnet
  • prometheus1008.eqiad.wmnet

Once a fixed kernel is out, these can be reverted to the latest Bookworm kernel again.

These Bookworm hosts use software RAID10, but have not yet rebooted into 6.1.135, on these we only need to uninstall the 6.1.135 kernel.

  • cloudnet2005-dev.codfw.wmnet
  • cloudnet2006-dev.codfw.wmnet
  • cloudnet2007-dev.codfw.wmnet
  • cloudnet2008-dev.codfw.wmnet
  • cloudnet1005.eqiad.wmnet
  • cloudnet1006.eqiad.wmnet
  • cloudrabbit1001.eqiad.wmnet
  • cloudrabbit1002.eqiad.wmnet
  • cloudrabbit1003.eqiad.wmnet
  • cloudservices2004-dev.codfw.wmnet
  • cloudservices2005-dev.codfw.wmnet
  • cloudservices1005.eqiad.wmnet
  • cloudservices1006.eqiad.wmnet
  • puppetserver2001.codfw.wmnet
  • puppetserver2002.codfw.wmnet
  • puppetserver2003.codfw.wmnet
  • puppetserver2004.codfw.wmnet
  • puppetserver1001.eqiad.wmnet
  • puppetserver1002.eqiad.wmnet
  • puppetserver1003.eqiad.wmnet

Event Timeline

Host rebooted by jelto@cumin1002 with reason: revert kernel

Host rebooted by jelto@cumin1002 with reason: revert kernel

Host rebooted by jelto@cumin1002 with reason: revert kernel

Host rebooted by jelto@cumin1002 with reason: revert kernel

MoritzMuehlenhoff updated the task description. (Show Details)
MoritzMuehlenhoff claimed this task.

All RAID10 servers which were upgraded to 6.1.135, are reverted to 6.1.133. In addition linux-image-amd64 has been downgraded to 6.1.133 on all RAID10 hosts and the Linux 6.1.135 has been purged. I've also updated the underlying Bookworm reboot task (https://phabricator.wikimedia.org/T392804) so that these servers are going to be rebooted to a fixed 6.1 kernel once fixed upstream.

As such, marking this task which provided the immediate remediation as fixed.

Change #1141905 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] benthos/mw-accesslog-metrics: set start_offset to latest

https://gerrit.wikimedia.org/r/1141905

Change #1141905 merged by Kamila Součková:

[operations/puppet@production] benthos/mw-accesslog-metrics: start_from_oldest: false

https://gerrit.wikimedia.org/r/1141905

Change #1142576 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] benthos/mw-accesslog-metrics: undo consumer group rename

https://gerrit.wikimedia.org/r/1142576

Change #1142576 merged by Kamila Součková:

[operations/puppet@production] benthos/mw-accesslog-metrics: undo consumer group rename

https://gerrit.wikimedia.org/r/1142576

The cause of the regression is now identified; the backport to 6.1. missed an depending patch: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1104460#76

Once that additional patch has landed in the next 6.1.x update (likely for the Bookworm 12.11 update which will be released next week), we can reboot the affected systems into the fixed kernels.

A fixed package is now in bookworm-proposed-updates and will be part of the Bookworm 12.11 point relese which will be released in nine days:
https://tracker.debian.org/news/1644528/accepted-linux-61137-1-source-into-proposed-updates/