Page MenuHomePhabricator

cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs
Open, MediumPublic

Description

I have reimaged cp5012 as part of T227432, and the procedure seems to have gone just fine. However, the host now fails to unpack the initramfs and thus cannot mount the root filesystem at boot. Relevant kernel output:

[    2.917721] Unpacking initramfs...
[    2.922806] Initramfs unpacking failed: junk in compressed archive
[...]
[    4.507431] List of all partitions:
[    4.511327] No filesystem could mount root, tried: [    4.516580] 
[    4.518247] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[    4.527473] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.9.0-11-amd64 #1 Debian 4.9.189-3+deb9u1
[    4.537182] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.5.4 08/17/2017
[    4.545532]  0000000000000000 ffffffffb59353d4 ffff9bea23cad000 ffffa876401afea0
[    4.553820]  ffffffffb5781aec 0000000000000010 ffffa876401afeb0 ffffa876401afe48
[    4.562110]  75dc92e53e96f7fb ffffa876401afe58 ffffa876401afeb8 0000000000000012
[    4.570400] Call Trace:
[    4.573134]  [<ffffffffb59353d4>] ? dump_stack+0x5c/0x78
[    4.579064]  [<ffffffffb5781aec>] ? panic+0xe4/0x242
[    4.584600]  [<ffffffffb633f4ef>] ? mount_block_root+0x281/0x2bd
[    4.591303]  [<ffffffffb633e87e>] ? set_debug_rodata+0xc/0xc
[    4.597617]  [<ffffffffb633f6b8>] ? prepare_namespace+0x12b/0x161
[    4.604417]  [<ffffffffb633f15a>] ? kernel_init_freeable+0x1dc/0x1ec
[    4.611511]  [<ffffffffb5c0ee40>] ? rest_init+0x80/0x80
[    4.617342]  [<ffffffffb5c0ee4a>] ? kernel_init+0xa/0x100
[    4.623366]  [<ffffffffb5c1c577>] ? ret_from_fork+0x57/0x70
[    4.629610] Kernel Offset: 0x34600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0)
[    4.647227] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

Note that grub can access the disk contents just fine. Here is what /boot looks like on cp5012 (host with issues):

grub> ls -l /boot                                                              
186598       20190920110345 config-4.9.0-11-amd64                              
4249376      20190920110345 vmlinuz-4.9.0-11-amd64                             
DIR          20191104144602 grub/                                              
3203475      20190920110345 System.map-4.9.0-11-amd64                          
41058058     20191104144839 initrd.img-4.9.0-11-amd64

On cp5011, recently reimaged and booting fine, /boot looks like this:

$ ls -l /boot
total 28740
-rw-r--r-- 1 root root   186598 Sep 20 11:03 config-4.9.0-11-amd64
drwxr-xr-x 5 root root     4096 Nov  1 15:37 grub
-rw-r--r-- 1 root root 21775867 Nov  1 15:48 initrd.img-4.9.0-11-amd64
-rw-r--r-- 1 root root  3203475 Sep 20 11:03 System.map-4.9.0-11-amd64
-rw-r--r-- 1 root root  4249376 Sep 20 11:03 vmlinuz-4.9.0-11-amd64

The size of initrd.img-4.9.0-11-amd64 on cp5012 seems to indicate that something went wrong when creating it.

Event Timeline

ema created this task.Nov 5 2019, 8:22 AM
Restricted Application added a project: Operations. · View Herald TranscriptNov 5 2019, 8:22 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema updated the task description. (Show Details)Nov 5 2019, 8:24 AM

Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts:

['cp5012.eqsin.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201911050832_ema_80975.log.

Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts:

['cp5012.eqsin.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201911050904_ema_86783.log.

Completed auto-reimage of hosts:

['cp5012.eqsin.wmnet']

Of which those FAILED:

['cp5012.eqsin.wmnet']

Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts:

['cp5012.eqsin.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201911050905_ema_87780.log.

ema triaged this task as Medium priority.Nov 5 2019, 9:24 AM
ema added a comment.Nov 5 2019, 9:43 AM

As an update, cp5012 is currently reimaging (Started first puppet run phase). The initramfs looks like this right now:

root@cp5012:~# ls -l /boot/initrd.img-4.9.0-11-amd64 
-rw-r--r-- 1 root root 21829203 Nov  5 09:37 /boot/initrd.img-4.9.0-11-amd64

That's larger than all other cp5* hosts:

cp5001.eqsin.wmnet: -rw-r--r-- 1 root root 21778052 Oct 11 14:09 /boot/initrd.img-4.9.0-11-amd64
cp5002.eqsin.wmnet: -rw-r--r-- 1 root root 21771076 Oct 11 14:09 /boot/initrd.img-4.9.0-11-amd64
cp5003.eqsin.wmnet: -rw-r--r-- 1 root root 21778514 Oct 11 14:09 /boot/initrd.img-4.9.0-11-amd64
cp5004.eqsin.wmnet: -rw-r--r-- 1 root root 21773333 Oct 11 14:09 /boot/initrd.img-4.9.0-11-amd64
cp5005.eqsin.wmnet: -rw-r--r-- 1 root root 21779739 Oct 11 14:09 /boot/initrd.img-4.9.0-11-amd64
cp5006.eqsin.wmnet: -rw-r--r-- 1 root root 21780525 Oct 11 14:09 /boot/initrd.img-4.9.0-11-amd64
cp5007.eqsin.wmnet: -rw-r--r-- 1 root root 21773520 Oct 28 13:20 /boot/initrd.img-4.9.0-11-amd64
cp5008.eqsin.wmnet: -rw-r--r-- 1 root root 21772950 Oct 30 14:28 /boot/initrd.img-4.9.0-11-amd64
cp5009.eqsin.wmnet: -rw-r--r-- 1 root root 21776603 Oct 31 09:56 /boot/initrd.img-4.9.0-11-amd64
cp5010.eqsin.wmnet: -rw-r--r-- 1 root root 21778121 Nov  1 12:09 /boot/initrd.img-4.9.0-11-amd64
cp5011.eqsin.wmnet: -rw-r--r-- 1 root root 21775867 Nov  1 15:48 /boot/initrd.img-4.9.0-11-amd64
cp5012.eqsin.wmnet: -rw-r--r-- 1 root root 21829203 Nov  5 09:37 /boot/initrd.img-4.9.0-11-amd64

Completed auto-reimage of hosts:

['cp5012.eqsin.wmnet']

and were ALL successful.

ema added a comment.Nov 5 2019, 9:56 AM

This time, after reimaging the host it did boot properly. Also, initramfs size is now in line with that of other cp5 systems:

cp5010.eqsin.wmnet: -rw-r--r-- 1 root root 21778121 Nov  1 12:09 /boot/initrd.img-4.9.0-11-amd64
cp5011.eqsin.wmnet: -rw-r--r-- 1 root root 21775867 Nov  1 15:48 /boot/initrd.img-4.9.0-11-amd64
cp5012.eqsin.wmnet: -rw-r--r-- 1 root root 21779243 Nov  5 09:47 /boot/initrd.img-4.9.0-11-amd64
ema moved this task from Triage to Caching on the Traffic board.Nov 5 2019, 3:35 PM
ema added a comment.Wed, Nov 20, 4:49 PM

This just happened on cp2023 too.

Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts:

['cp2023.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201911201650_ema_5874.log.

Completed auto-reimage of hosts:

['cp2023.codfw.wmnet']

and were ALL successful.