Page MenuHomePhabricator

Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image
Open, NormalPublic

Description

The new Debian Jessie netboot images seems to end up in kernel panic while loading (just before d-i).

As part of T181518 I tried to reimage kafka1023, meanwhile @Marostegui db1111. We both got:

https://phabricator.wikimedia.org/P6451

Event Timeline

elukey triaged this task as Normal priority.Dec 12 2017, 6:30 PM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 12 2017, 6:30 PM
elukey renamed this task from Debian Jessie reimage/install end up in kernel panic with 8.9 netboot image to Debian Jessie reimage/install ends up in kernel panic with 8.9 netboot image .Dec 12 2017, 6:31 PM
elukey updated the task description. (Show Details)
MoritzMuehlenhoff renamed this task from Debian Jessie reimage/install ends up in kernel panic with 8.9 netboot image to Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image .Dec 12 2017, 6:32 PM

@Marostegui , @elukey : Can you try passing numa=off to the kernel in d-i? That should work around it. There'll be a revised 3.16 package (and a refreshed d-i image) via jessie-updates, that should then fix it for good.

Live hacked install1002's /srv/tftpboot/jessie-installer/pxelinux.cfg/ttyS1-115200 and didn't get the kernel panic! (credis to @fgiunchedi for the technical help :)

Trying db1111 again - will report back!

Just got the following (I did a manual pxe boot, not using wmf-auto-reimage):

  ┌──────────────────┤ [!!] Finish the installation ├───────────────────┐
┌─│                                                                     │ ┐
│ │                   Failed to run preseeded command                   │ │
│ │ Execution of preseeded command "wget -O /tmp/late_command           │ │
│ │ http://apt.wikimedia.org/autoinstall/scripts/late_command.sh && sh  │ │
│ │ /tmp/late_command" failed with exit code 100.                       │ │
│ │                                                                     │ │
└─│                             <Continue>

I tried db1111 again and got a kernel panic (used wmf-auto-reimage):

Loading Linux 3.16.0-4-amd64 ...
Loading initial ramdisk ...
[    0.613068] general protection fault: 0000 [#1] SMP
[    0.618632] Modules linked in:
[    0.622050] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W     3.16.0-4-amd64 #1 Debian 3.16.51-2
[    0.632255] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.5.5 08/16/2017
[    0.640616] task: ffff8840662e12d0 ti: ffff8840662e4000 task.ti: ffff8840662e4000
[    0.648976] RIP: 0010:[<ffffffff8109be3d>]  [<ffffffff8109be3d>] build_sched_domains+0x72d/0xcf0
[    0.658802] RSP: 0000:ffff8840662e7df8  EFLAGS: 00010202
[    0.664734] RAX: 0000ffff00000100 RBX: 0000000000000000 RCX: 000000000000000f
[    0.672704] RDX: 0000000000016e48 RSI: 0000000000000000 RDI: 0000000000000200
[    0.680676] RBP: ffff88406504c198 R08: ffff88406504ff60 R09: 00000000000002a4
[    0.688647] R10: 0000000000000000 R11: ffff8840662e7b06 R12: ffff88406504ff40
[    0.696619] R13: 0000000000000200 R14: ffff8840657ee2c0 R15: 0000000000000200
[    0.704591] FS:  0000000000000000(0000) GS:ffff88407f200000(0000) knlGS:0000000000000000
[    0.713631] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.720048] CR2: ffff88807ffff000 CR3: 0000000001813000 CR4: 00000000003407f0
[    0.728020] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.735992] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.743963] Stack:
[    0.746206]  ffff884000000000 ffff88406504ff58 ffff88406504c100 ffff8840657ee2c0
[    0.754498]  0000000000000000 0000000000000000 0000000000000000 ffff888064dae180
[    0.762793]  0000000000000000 000000000000f1c8 ffffffff00000000 0000000000000000
[    0.771091] Call Trace:
[    0.773819]  [<ffffffff8192332c>] ? sched_init_smp+0x398/0x452
[    0.780335]  [<ffffffff815244de>] ? mutex_lock+0xe/0x2a
[    0.786171]  [<ffffffff81069b23>] ? put_online_cpus+0x23/0x80
[    0.792589]  [<ffffffff810f284c>] ? stop_machine+0x2c/0x40
[    0.798716]  [<ffffffff8190415e>] ? kernel_init_freeable+0xdd/0x1e1
[    0.805716]  [<ffffffff81512c00>] ? rest_init+0x80/0x80
[    0.811552]  [<ffffffff81512c0a>] ? kernel_init+0xa/0xf0
[    0.817484]  [<ffffffff81525c58>] ? ret_from_fork+0x58/0x90
[    0.823707]  [<ffffffff81512c00>] ? rest_init+0x80/0x80
[    0.829543] Code: c0 0f 85 46 05 00 00 48 8b 74 24 08 48 c7 c2 00 dd a6 81 bf ff ff ff ff e8 91 78 21 00 48 98 49 8b 56 10 48 8b 04 c5 a0 1e 8e 81 <48> 8b 14 10 b8 01 00 00 00 49 89 54 24 10 f0 0f c1 02 85 c0 75
[    0.851139] RIP  [<ffffffff8109be3d>] build_sched_domains+0x72d/0xcf0
[    0.858343]  RSP <ffff8840662e7df8>
[    0.862239] ---[ end trace ad3821170e4c8fd4 ]---
[    0.867406] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    0.867406]
[    0.877616] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    0.877616]

Just got the following (I did a manual pxe boot, not using wmf-auto-reimage):

  ┌──────────────────┤ [!!] Finish the installation ├───────────────────┐
┌─│                                                                     │ ┐
│ │                   Failed to run preseeded command                   │ │
│ │ Execution of preseeded command "wget -O /tmp/late_command           │ │
│ │ http://apt.wikimedia.org/autoinstall/scripts/late_command.sh && sh  │ │
│ │ /tmp/late_command" failed with exit code 100.                       │ │
│ │                                                                     │ │
└─│                             <Continue>

Got the same error when trying the installation PXE booting manually.

Tried to connect with install-console when the late_command failure msg is prompted, and this is the output:

~ # sh /tmp/late_command
+ mkdir /target/root/.ssh
+ wget -O /target/root/.ssh/authorized_keys http://apt.wikimedia.org/autoinstall/ssh/authorized_keys
Connecting to apt.wikimedia.org ([2620:0:861:1:208:80:154:22]:80)
authorized_keys      100% |*******************************************************************************************************************************************************************************************************************************|   730   0:00:00 ETA
+ chmod go-rwx /target/root/.ssh/authorized_keys
+ apt-install openssh-server puppet lldpd

After the apt-install I can see a ton of blank lines, and if I echo $? then I get 100. So I tried each package separately, and only apt-install puppet returns 100.

I think I found the problem:

With the switch to puppet 4, puppet it has gained several new deps:

Package: puppet
Version: 4.8.2-5~bpo8+1
Depends: init-system-helpers (>= 1.18~), adduser, facter, ruby-augeas, hiera, lsb-base, ruby | ruby-interpreter, ruby-deep-merge, ruby-rgen, ruby-safe-yaml, ruby-shadow

Package: puppet
Version: 3.8.5-2~bpo8+2
Depends: init-system-helpers (>= 1.18~), puppet-common (= 3.8.5-2~bpo8+2), ruby | ruby-interpreter

One of those (ruby-deep-merge) is not present in stock jessie, but only in jessie-backports. But jessie-backports is not yet enabled at this stage of the installation process. So we need to import it to jessie-wikimedia, then it should work again.

herron added a subscriber: herron.Dec 13 2017, 4:49 PM

ruby-deep-merge has been imported to jessie-wikimedia and that appears to have solved this problem. db1111 no longer errors out from missing deps when attempting to apt-get install puppet.

My last d-i went fine! I kept in place the live hack on install1002, let's puppetize it if the upstream fix will not come soon.

My last reimaged failed with:

08:11:20 | db1111.eqiad.wmnet | Unable to run wmf-auto-reimage-host: could not convert string to float: Warning: Setting configtimeout is deprecated.
   (at /usr/lib/ruby/vendor_ruby/puppet/settings.rb:1146:in `issue_deprecation_warning')
/usr/local/share/bash/puppet-common.sh: line 68: /var

db1111 got installed fine. But @Volans and myself noticed that we are no longer generating the puppet cert on the server, so the installation gets stuck until someone manually connects and issue a puppet run to generate the cert.

Change 398244 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] base: fix dependency relationship

https://gerrit.wikimedia.org/r/398244

Change 398279 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-auto-reimage: generate Puppet cert if needed

https://gerrit.wikimedia.org/r/398279

Change 398244 merged by Volans:
[operations/puppet@production] base: fix dependency relationship

https://gerrit.wikimedia.org/r/398244

Change 398303 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] base: fix dependency relationship

https://gerrit.wikimedia.org/r/398303

So the reimages now work but the above code changes need to be merged/tested to get the first puppet run and wmf-auto-reimage work properly after d-i.

A revised kernel has been released at https://lists.debian.org/debian-stable-announce/2017/12/msg00002.html

But the netinst image hasn't been re-generated (so at this point we still need the workaround), I'll try to find out whether that is planned.

Change 398303 merged by Volans:
[operations/puppet@production] base: fix dependency relationship

https://gerrit.wikimedia.org/r/398303

Change 398279 merged by Volans:
[operations/puppet@production] wmf-auto-reimage: generate Puppet cert if needed

https://gerrit.wikimedia.org/r/398279

Change 398803 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-auto-reimage: ignore exit code for cert gen

https://gerrit.wikimedia.org/r/398803

Change 398803 merged by Volans:
[operations/puppet@production] wmf-auto-reimage: ignore exit code for cert gen

https://gerrit.wikimedia.org/r/398803

Change 398807 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set numa=off to tftpboot jessie's ttyS1-115200 config

https://gerrit.wikimedia.org/r/398807

Change 398807 merged by Elukey:
[operations/puppet@production] Set numa=off to tftpboot jessie's ttyS1-115200 config

https://gerrit.wikimedia.org/r/398807

The reimage scripts should be back on track and work as expected. It was tested today with a couple of reimages. I cannot exclude we'll find some other corner cases with less used OS versions and with Puppet4 clients. But from my side this could be resolved.

We still have to resolve the workaround on install1102, it is still in place as far as I remember.

Change 399161 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-auto-reimage: improve resume capabilities

https://gerrit.wikimedia.org/r/399161

Unfortunately there won't be rebuilt netinst images until the next point release:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=883938#336

Marostegui changed the status of subtask T180788: Rack and setup db1111 and db1112 from Stalled to Open.Dec 19 2017, 8:44 PM

Change 399161 merged by Volans:
[operations/puppet@production] wmf-auto-reimage: improve resume capabilities

https://gerrit.wikimedia.org/r/399161

Change 404439 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-auto-reimage: fix host validation logic

https://gerrit.wikimedia.org/r/404439

Change 404439 merged by Volans:
[operations/puppet@production] wmf-auto-reimage: fix host validation logic

https://gerrit.wikimedia.org/r/404439