Page MenuHomePhabricator

Something is wrong with installer root disk stuff
Closed, ResolvedPublic

Description

I've had the same new problem twice now, on initial boot post-install:

Loading Linux 4.4.0-2-amd64 ...
Loading initial ramdisk ...
Loading, please wait...
mdadm: No devices listed in conf file were found.
Gave up waiting for root device.  Common problems:
 - Boot args (cat /proc/cmdline)
   - Check rootdelay= (did the system wait long enough?)
   - Check root= (did the system wait for the right device?)
 - Missing modules (cat /proc/modules; ls /dev)
ALERT!  /dev/disk/by-uuid/49cc7fa6-a12e-4aa9-a433-26b6aba9b06a does not exist.  Dropping to a shell!
modprobe: module ehci-orion not found in modules.dep


BusyBox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash)
Enter 'help' for a list of built-in commands.

/bin/sh: can't access tty; job control turned off
(initramfs) cat /etc/mdadm/mdadm.conf 
HOMEHOST <system>
ARRAY /dev/md/0  metadata=1.2 UUID=09d24517:e5d90d78:98e26e42:6a3ecb7d name=cp1055:0
(initramfs) mdadm --assemble /dev/md/0
mdadm: /dev/md/0 has been started with 2 drives.
(initramfs) exit
/dev/md0: clean, 42804/610800 files, 375335/2439168 blocks
[   92.138317] kvm: disabled by bios
[   92.206867] kvm: disabled by bios
[   93.037036] bnx2x 0000:01:00.0 eth0: Warning: Unqualified SFP+ module detected, Port 0 from FINISAR CORP.    part number FTLX1471D3BCL   

Debian GNU/Linux 8 cp1055 ttyS1

cp1055 login:

At a glance it looks similar to our md device assembly timeout issues, but it's not that. It's that the UUID for the root device is wrong, so it doesn't even try to assemble the correct device. Anyone have a clue about recent changes that could've precipitated this behavior?

Event Timeline

BBlack created this task.Nov 2 2016, 9:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 2 2016, 9:00 PM

I've ran into the same thing while installing restbase1016, assembling the array manually the first time worked and subsequent reboots were fine. Oddly enough the same didn't happen on restbase101[78] where installation and the first reboot completed fine (array assembled etc)

Gehel added a subscriber: Gehel.Feb 15 2017, 10:49 AM

I ran into the same issue when migrating elasticsearch servers to jessie. I was told to manually add "rootdelay" to the grub command line. It looks like we do have a rootdelay configured for precise and trusty installers. It might make sense to add the same for jessie as well (really not sure, my understanding of the issue is limited). In any case: https://gerrit.wikimedia.org/r/337804

Also happened on a range of mw servers when reimaging them to jessie (https://phabricator.wikimedia.org/T144911). Maybe let one of the local DC ops look at the initial system output to get a sense of what's causing the delays?

RobH added a subscriber: RobH.Aug 7 2017, 11:17 PM

I'm working on Ops Clinic Duty this week, and as part of that work, I've done a search for unowned, high priority tasks in Operations. This task came up as a high priority Operations task without an owner. So I wanted to address it as part of clinic duty.

So the (non)detection of disks and drop to busybox has been a long standing issue, so much so that our earlier ubuntu (precise) installs included rootdelay=90. Many, many, many SATA disks take a bit too long in the initial detection of systems. We've had it come up a few times since. On many, but not all, a single rootdelay=5 for the first initial OS boot with raid build is enough, and future reboots (both warm and cold) will work. I know if I manually include rootdelay=5 in a post-install boot, I test this by shutting the machine down remotely, and booting it again once the initial OS boot has been completed and raid rebuild is done. If it then fails, I'd bring up patching in root delay.

This in fact did happen again on some new installs, but then we switched from jessie to strech, and the issue resolved itself. Resolution was not via adding a root delay, but in stretch simply performing a second re-scan of ALL disk devices when the first scan times out on slower SATA disk spin ups.

Can we simply move all systems with this issue to stretch?

I'm running into this issue again when reimaging restbase systems as part of T184100: Reprovision legacy Cassandra nodes into new cluster. From the comments above it seems that setting rootdelay= fixes the issue but we're not applying the fix for jessie. I'll test it on the next reimage.

Can we simply move all systems with this issue to stretch?

IMO no, moving to stretch is usually a non trivial amount of work while reimages with jessie can happen for unrelated reasons (e.g. the example above)

So I tried to investigate this on restbase1013 which reliably failed to reboot cleanly after d-i finished:

Loading Linux 4.9.0-0.bpo.5-amd64 ...
Loading initial ramdisk ...          
[    0.150531] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
Loading, please wait...
mdadm: No devices listed in conf file were found.
Gave up waiting for root device.  Common problems:
 - Boot args (cat /proc/cmdline)
   - Check rootdelay= (did the system wait long enough?)
   - Check root= (did the system wait for the right device?)
 - Missing modules (cat /proc/modules; ls /dev)
ALERT!  /dev/disk/by-uuid/b7c29ec8-d92f-464b-8061-34de70a3a594 does not exist.  Dropping to a shell!
modprobe: module ehci-orion not found in modules.dep


BusyBox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash)
Enter 'help' for a list of built-in commands.

/bin/sh: can't access tty; job control turned off

It looks like the correct config is there, but the array isn't assembled (issuing mdadm --scan --assemble will do the right thing)

(initramfs) mdadm --detail --scan
(initramfs) cat /etc/mdadm/mdadm.conf 
HOMEHOST <system>
ARRAY /dev/md/0  metadata=1.2 UUID=cc49411e:305ae2bf:ec49670d:c0937b46 name=restbase1013:0
ARRAY /dev/md/1  metadata=1.2 UUID=a9f5cbe4:a177a81e:7833681e:fb69709a name=restbase1013:1
ARRAY /dev/md/2  metadata=1.2 UUID=c78279e6:dd6c8bc4:f7141d3a:7008cf5d name=restbase1013:2
(initramfs)

From bugs.debian.org it seems people are reporting a similar problem in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=814036 (fixed in mdadm 3.4-2) which in turn is a regression introduced in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=784070 wrt incremental assembly of arrays.

I can reproduce this at will on jessie new installs, I'll debug further with not passing quiet upon first reboot at next reinstall. I tried (re)booting restbase1013 after the debug above without quiet and all arrays were assembled.

Booting both restbase1013 and restbase1015 without quiet it looks like a race condition: on 1015 assembly worked as intended but on 1013 it failed in the usual way we've experienced:

restbase1015

[    6.508724] scsi 0:0:5:0: Direct-Access     ATA      Samsung SSD 850  2B6Q PQ: 0 ANSI: 6
[    6.582567] scsi 0:0:6:0: Enclosure         HP       P440ar           6.06 PQ: 0 ANSI: 5
[    6.659315] scsi 0:0:0:0: Attached scsi generic sg0 type 12
[    6.661398] sd 0:0:1:0: [sda] 2000409264 512-byte logical blocks: (1.02 TB/954 GiB)
[    6.661564] sd 0:0:2:0: [sdb] 2000409264 512-byte logical blocks: (1.02 TB/954 GiB)
[    6.661714] sd 0:0:3:0: [sdc] 2000409264 512-byte logical blocks: (1.02 TB/954 GiB)
[    6.661807] sd 0:0:4:0: [sdd] 2000409264 512-byte logical blocks: (1.02 TB/954 GiB)
[    6.661814] sd 0:0:5:0: [sde] 2000409264 512-byte logical blocks: (1.02 TB/954 GiB)
[    6.662461] sd 0:0:1:0: [sda] Write Protect is off
[    6.662626] sd 0:0:2:0: [sdb] Write Protect is off
[    6.662845] sd 0:0:3:0: [sdc] Write Protect is off
[    6.662864] sd 0:0:4:0: [sdd] Write Protect is off
[    6.663186] sd 0:0:5:0: [sde] Write Protect is off
[    6.663221] sd 0:0:2:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    6.663222] sd 0:0:1:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    6.663420] sd 0:0:3:0: [sdc] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    6.663441] sd 0:0:4:0: [sdd] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    6.663761] sd 0:0:5:0: [sde] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    6.665626]  sdb: sdb1 sdb2 sdb3 sdb4
[    6.665901]  sdd: sdd1 sdd2 sdd3 sdd4
[    6.665902]  sdc: sdc1 sdc2 sdc3 sdc4
[    6.665982]  sda: sda1 sda2 sda3 sda4
[    6.666238]  sde: sde1 sde2 sde3 sde4
[    6.668001] sd 0:0:2:0: [sdb] Attached SCSI disk
[    6.668913] sd 0:0:3:0: [sdc] Attached SCSI disk
[    6.668962] sd 0:0:4:0: [sdd] Attached SCSI disk
[    6.668982] sd 0:0:5:0: [sde] Attached SCSI disk
[    6.669183] sd 0:0:1:0: [sda] Attached SCSI disk
[    7.383604] sd 0:0:1:0: Attached scsi generic sg1 type 0
[    7.408795] sd 0:0:2:0: Attached scsi generic sg2 type 0
[    7.434153] sd 0:0:3:0: Attached scsi generic sg3 type 0
[    7.459453] sd 0:0:4:0: Attached scsi generic sg4 type 0
[    7.485112] sd 0:0:5:0: Attached scsi generic sg5 type 0
[    7.510459] scsi 0:0:6:0: Attached scsi generic sg6 type 13
[    7.557236] random: fast init done
[    7.659387] ses 0:0:6:0: Attached Enclosure device
Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... Begin: Assembling all MD arrays ... [    7.714312] md: md0 stopped.
[    7.729774] md: bind<sdb1>
[    7.743022] md: bind<sdc1>
[    7.756099] md: bind<sdd1>
[    7.769337] md: bind<sde1>
[    7.782669] md: bind<sda1>
[    7.797537] md: raid1 personality registered for level 1
[    7.823435] md/raid1:md0: not clean -- starting background reconstruction
[    7.855883] md/raid1:md0: active with 5 out of 5 mirrors
[    7.881268] md0: detected capacity change from 0 to 29981933568
mdadm: /dev/md/0 has been started with 5 drives.
[    7.941461] md: md1 stopped.
[    7.956920] md: bind<sdb2>
[    7.970100] md: bind<sdc2>
[    7.983478] md: bind<sdd2>
[    7.996764] md: bind<sde2>
[    8.010075] md: bind<sda2>
[    8.025202] md/raid1:md1: not clean -- starting background reconstruction
[    8.057351] md/raid1:md1: active with 5 out of 5 mirrors
[    8.082787] md1: detected capacity change from 0 to 999751680
mdadm: /dev/md/1 has been started with 5 drives.
[    8.138681] md: md2 stopped.
[    8.154035] md: bind<sdb3>
[    8.167247] md: bind<sdc3>
[    8.180571] md: bind<sdd3>
[    8.193847] md: bind<sde3>
[    8.207171] md: bind<sda3>
[    8.221952] md: raid10 personality registered for level 10
[    8.248708] md/raid10:md2: not clean -- starting background reconstruction
[    8.281452] md/raid10:md2: active with 5 out of 5 devices
[    8.307080] md2: detected capacity change from 0 to 49956782080
mdadm: /dev/md/2 has been started with 5 drives.
Success: assembled all arrays.
done.
done.
Begin: Running /scripts/local-premount ... [    8.357878] PM: Starting manual resume from disk
done.
Begin: Will now check root file system ... fsck from util-linux 2.25.2
[/sbin/fsck.ext4 (1) -- /dev/md0] fsck.ext4 -a -C0 /dev/md0 
/dev/md0: clean, 42199/1831424 files, 459485/7319808 blocks

restbase1013

[    4.853202] hub 3-0:1.0: USB hub found
[    4.853233] hub 3-0:1.0: 15 ports detected
[    4.854002] xhci_hcd 0000:00:14.0: xHCI Host Controller
[    4.854008] xhci_hcd 0000:00:14.0: new USB bus registered, assigned bus number 4
[    4.854069] usb usb4: New USB device found, idVendor=1d6b, idProduct=0003
[    4.854071] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    4.854072] usb usb4: Product: xHCI Host Controller
[    4.854074] usb usb4: Manufacturer: Linux 4.9.0-0.bpo.5-amd64 xhci-hcd
[    4.854075] usb usb4: SerialNumber: 0000:00:14.0
[    4.854293] hub 4-0:1.0: USB hub found
[    4.854313] hub 4-0:1.0: 6 ports detected
[    4.895962] usb 2-1: New USB device found, idVendor=8087, idProduct=8002
[    4.895965] usb 2-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    4.896263] hub 2-1:1.0: USB hub found
[    4.896333] hub 2-1:1.0: 8 ports detected
[    5.003968] usb 1-1: New USB device found, idVendor=8087, idProduct=800a
[    5.003970] usb 1-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    5.004267] hub 1-1:1.0: USB hub found
[    5.004336] hub 1-1:1.0: 6 ports detected
[    5.179572] usb 3-3: new high-speed USB device number 2 using xhci_hcd
[    5.319693] usb 3-3: New USB device found, idVendor=0424, idProduct=2660
[    5.319696] usb 3-3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    5.320011] hub 3-3:1.0: USB hub found
[    5.320033] hub 3-3:1.0: 2 ports detected
[    5.888358] uhci_hcd 0000:01:00.4: UHCI Host Controller
[    5.913332] uhci_hcd 0000:01:00.4: new USB bus registered, assigned bus number 5
[    5.949043] uhci_hcd 0000:01:00.4: detected 8 ports
[    5.972255] uhci_hcd 0000:01:00.4: port count misdetected? forcing to 2 ports
[    6.006235] uhci_hcd 0000:01:00.4: irq 19, io base 0x00001300
[    6.033956] usb usb5: New USB device found, idVendor=1d6b, idProduct=0001
[    6.066198] usb usb5: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    6.100989] usb usb5: Product: UHCI Host Controller
[    6.103913] scsi 0:0:1:0: Direct-Access     ATA      Samsung SSD 850  2B6Q PQ: 0 ANSI: 6
[    6.162513] usb usb5: Manufacturer: Linux 4.9.0-0.bpo.5-amd64 uhci_hcd
[    6.193743] usb usb5: SerialNumber: 0000:01:00.4
[    6.216445] hub 5-0:1.0: USB hub found
[    6.234635] hub 5-0:1.0: 2 ports detected
Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... Begin: Assembling all MD arrays ... mdadm: No devices listed in conf file were found.
Failure: failed to assemble all arrays.
done.
done.
Begin: Running /scripts/local-premount ... done.
Begin: Waiting for root file system ... [    6.286291] scsi 0:0:2:0: Direct-Access     ATA      Samsung SSD 850  2B6Q PQ: 0 ANSI: 6
[    6.390311] scsi 0:0:3:0: Direct-Access     ATA      Samsung SSD 850  2B6Q PQ: 0 ANSI: 6
[    6.490386] scsi 0:0:4:0: Direct-Access     ATA      Samsung SSD 850  2B6Q PQ: 0 ANSI: 6
[    6.566416] scsi 0:0:5:0: Direct-Access     ATA      Samsung SSD 850  2B6Q PQ: 0 ANSI: 6
[    6.648374] scsi 0:0:6:0: Enclosure         HP       P440ar           6.06 PQ: 0 ANSI: 5
[    6.729305] scsi 0:0:0:0: Attached scsi generic sg0 type 12
[    6.755493] scsi 0:0:1:0: Attached scsi generic sg1 type 0
[    6.757865] sd 0:0:1:0: [sda] 2000409264 512-byte logical blocks: (1.02 TB/954 GiB)
[    6.757965] sd 0:0:2:0: [sdb] 2000409264 512-byte logical blocks: (1.02 TB/954 GiB)
[    6.758168] sd 0:0:3:0: [sdc] 2000409264 512-byte logical blocks: (1.02 TB/954 GiB)
[    6.758206] sd 0:0:5:0: [sde] 2000409264 512-byte logical blocks: (1.02 TB/954 GiB)
[    6.758250] sd 0:0:4:0: [sdd] 2000409264 512-byte logical blocks: (1.02 TB/954 GiB)
[    6.759207] sd 0:0:2:0: [sdb] Write Protect is off
[    6.759244] sd 0:0:3:0: [sdc] Write Protect is off
[    6.759269] sd 0:0:1:0: [sda] Write Protect is off
[    6.759446] sd 0:0:5:0: [sde] Write Protect is off
[    6.759481] sd 0:0:4:0: [sdd] Write Protect is off
[    6.759850] sd 0:0:2:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    6.759864] sd 0:0:3:0: [sdc] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    6.759905] sd 0:0:1:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    6.760022] sd 0:0:5:0: [sde] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    6.760052] sd 0:0:4:0: [sdd] Write cache: disabled, read cache: enabled, supports DPO and FUA
[    6.761085] random: fast init done
[    6.762548]  sde: sde1 sde2 sde3 sde4
[    6.762559]  sdb: sdb1 sdb2 sdb3 sdb4
[    6.762650]  sda: sda1 sda2 sda3 sda4
[    6.762694]  sdd: sdd1 sdd2 sdd3 sdd4
[    6.762733]  sdc: sdc1 sdc2 sdc3 sdc4
[    6.765709] sd 0:0:2:0: [sdb] Attached SCSI disk
[    6.765859] sd 0:0:4:0: [sdd] Attached SCSI disk
Begin: Running /[    6.765880] sd 0:0:5:0: [sde] Attached SCSI disk
[    6.766026] sd 0:0:1:0: [sda] Attached SCSI disk
[    6.766083] sd 0:0:3:0: [sdc] Attached SCSI disk
scripts/local-block ... done.
[    7.495690] sd 0:0:2:0: Attached scsi generic sg2 type 0
[    7.521068] sd 0:0:3:0: Attached scsi generic sg3 type 0
[    7.546459] sd 0:0:4:0: Attached scsi generic sg4 type 0
[    7.571935] sd 0:0:5:0: Attached scsi generic sg5 type 0
[    7.597458] scsi 0:0:6:0: Attached scsi generic sg6 type 13
[    7.746264] ses 0:0:6:0: Attached Enclosure device
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
[   33.872484] hpsa 0000:03:00.0: Acknowledging event: 0x80000000 (HP SSD Smart Path configuration change)
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
Begin: Running /scripts/local-block ... done.
done.
Gave up waiting for root device.  Common problems:
 - Boot args (cat /proc/cmdline)
   - Check rootdelay= (did the system wait long enough?)
   - Check root= (did the system wait for the right device?)
 - Missing modules (cat /proc/modules; ls /dev)
ALERT!  /dev/disk/by-uuid/87bb00f6-1f3b-4549-8d9e-7da589679f68 does not exist.  Dropping to a shell!
modprobe: module ehci-orion not found in modules.dep
[   36.463156] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[   36.495572] hidraw: raw HID events driver (C) Jiri Kosina
[   36.522272] usbcore: registered new interface driver usbhid
[   36.548699] usbhid: USB HID core driver


BusyBox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash)
Enter 'help' for a list of built-in commands.

/bin/sh: can't access tty; job control turned off
(initramfs)

I tried manually setting rootdelay from within grub on the first boot of restbase2007 and 2008 with 5 and 3 respectively, both worked as in the raid arrays were assembled

I think I am suffering this, but not on first boot, but on installer (both jessie and stretch). This doesn't happen with regular dbs, but these have 20TB, which may affect it being extra-slow. I will try rootdelayon the installer and see if that helps.

My issue could be an installer one, so ignore my latest comment.

RobH added a comment.May 4 2018, 8:59 PM

This seems fixed by adding the rootdelay for jessie and older, and stretch has it go away.

If it happens on jessie, adding a rootdelay=15 to the initial boot post install fixes it.

If we aren't going to go after the actual cause (since they eliminated it in stretch) we could resolve this?

MoritzMuehlenhoff closed this task as Resolved.May 8 2018, 2:09 PM
MoritzMuehlenhoff claimed this task.

This seems fixed by adding the rootdelay for jessie and older, and stretch has it go away.
If it happens on jessie, adding a rootdelay=15 to the initial boot post install fixes it.
If we aren't going to go after the actual cause (since they eliminated it in stretch) we could resolve this?

Ack, given that this is fixed in stretch and has a workaround in jessie, I'm being bold and closing it.

Dzahn added a subscriber: Dzahn.EditedMay 29 2019, 8:22 PM

I just ran into this when reinstalling phab2001 from jessie to stretch.

After installer was done and it rebooted, It fell back to busybox with "mdadm: No devices listed in conf file were found". and a powercycle did not change it. Eventually i was able to fix it with the

mdadm --assemble /dev/md/0

as originally reported in this ticket.

edit: it was actually still using jessie installer when that happened, not stretch