Page MenuHomePhabricator

jessie installer fails after partitioning stage- same recipe works on trusty and a it worked few weeks ago
Closed, ResolvedPublic

Description

┌───────────────────────┤ [!!] Partition disks ├───────────────────────┐
│                                                                      │
│ The attempt to mount a file system with type ext3 in SCSI1 (2,0,0),  │
│ partition #1 (sda) at / failed.                                      │
│                                                                      │
│ You may resume partitioning from the partitioning menu.              │
│                                                                      │
│ Do you want to resume partitioning?                                  │
│                                                                      │
│     <Go Back>                                      <Yes>    <No>     │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Yes and No ends up in a an infinite loop, go back doesn't allow me to fix anything on the menu.

~ # mount
rootfs on / type rootfs (rw,size=66097196k,nr_inodes=8257582)
none on /run type tmpfs (rw,nosuid,relatime,size=6609720k,mode=755)
none on /proc type proc (rw,relatime)
none on /sys type sysfs (rw,relatime)
devtmpfs on /dev type devtmpfs (rw,relatime,size=33030336k,nr_inodes=8257584,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)

~ # fdisk /dev/sda

Welcome to fdisk (util-linux 2.25.2).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): p
Disk /dev/sda: 1.6 TiB, 1796638507008 bytes, 3509059584 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x0224a576

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sda1  *        2048   78125055   78123008 37.3G 83 Linux
/dev/sda2       78127102 3509057535 3430930434  1.6T  5 Extended
/dev/sda5       78127104   93749247   15622144  7.5G 82 Linux swap / Solaris
/dev/sda6       93751296 3509057535 3415306240  1.6T 8e Linux LVM

~ # mount /dev/sda1 /target
mount: mounting /dev/sda1 on /target failed: Invalid argument
~ # mount /dev/tank/data /mnt
(ok)

Event Timeline

jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added a project: SRE.
jcrespo subscribed.

And yes, I tried formatting it manually, too.

Change 267681 had a related patch set uploaded (by Jcrespo):
Testing db jessie installer problems on db2030

https://gerrit.wikimedia.org/r/267681

RobH subscribed.

Please note I had this exact same (frustrating) issue today for my installation on oresrdb1001. I also tried manual formatting, formatting with a single disk, multiple recipies, including the mw.cfg, raid1-lvm-ext4-srv, and a few others, it seems it failed on all partioning, including manual. All of this is listed on T125562 as well.

Switched the same system to trusty from jessie and it installed flawlessly.

Please note that like db2030 (that @jcrespo was testing) oresrdb1001 is a dell system. Though one is an R510, and the other an R420.

jcrespo renamed this task from jessie installer fails when using db hosts- same recipe works on trusty and on other hosts/a few weeks ago to jessie installer fails after partitioning stage- same recipe works on trusty and a it worked few weeks ago.Feb 3 2016, 9:17 PM
jcrespo set Security to None.

CCing @MoritzMuehlenhoff as the "package expert". I may be wrong, but this smells like upstream bug on jessie installer if recently updated or a breaking change on our particular installer.

RobH triaged this task as Unbreak Now! priority.Feb 3 2016, 9:36 PM

I'm assigning directly to @MoritzMuehlenhoff, since it is easy to miss a CC addition to a task but harder to overlook a direct assignment.

@MoritzMuehlenhoff: If this shouldn't be assigned to you, but you know who would work on it, please reassign.

I'm also raising the priority to unbreak now, since this particular issue will start blocking Jessie installs.

RobH lowered the priority of this task from Unbreak Now! to High.Feb 3 2016, 9:38 PM

Moritz chatted with us about this in IRC and plans to work on it tomorrow, so lowering to High from UBN.

attaching log files saved earlier from oresrdb1001

I hope this will help troubleshoot the the problem because last night while I was working on my lab I had the same problem so i did reproduce the error and sharing this here.

filesys eror.PNG (347×618 px, 26 KB)

After long hours of troubleshooting I went on http://ftp.debian.org/debian/dists/jessie/main/installer-amd64/current/images/
to download the new netboot and as you can see below i renamed the jessie-installer folder to old and create another jessie- installer folder and copied the new netboot into it and restart my installed and the problem was fixed.
drwxr-xr-x 5 root root 4096 Feb 4 00:23 jessie-installer/
drwxr-xr-x 4 root root 4096 Feb 4 16:55 jessie-installerold/
drwxr-xr-x 5 root root 4096 Jan 3 20:59 trusty-installer/

carbon:/srv/mirrors/debian/dists/jessie/main/installer-amd64

drwxr-sr-x  3 mirror mirror 4096 Apr 22  2015 20150422
drwxr-sr-x  3 mirror mirror 4096 Jan 18 16:05 20150422+deb8u3
lrwxrwxrwx  1 mirror mirror   15 Jan 23 10:32 current -> 20150422+deb8u3

So I don't see any puppetization or central management in place for the tftp boot firmware files. The last time we had updates and I downloaded them, we just pushed them into place in the directory structure.

Is that still how we prefer to maintain these firmware/install images? (I don't see a different method or git files in place.)

It is, on the volatile dir on puppetmaster, we have to rebuild that, the run puppet. Too late for me today, good night and good luck.

So Jaime pointed out palladium has them and indeed the tftp_server.pp includes the call:

class install_server::tftp_server {
   file { '/srv/tftpboot':
      # config files in the puppet repository,
      # larger files like binary images in volatile
      source       => [
          'puppet:///modules/install_server/tftpboot',
          # lint:ignore:puppet_url_without_modules
          'puppet:///volatile/tftpboot'
          # lint:endignore

Next up, how do those files come to live on palladium?

I would bet a beer (or an ice cream) that the copy in volatile is put there manually. You might ask anyone who set up a new installer (e.g. the jessie installer, I think faidon might have done the work on that).

I don't really understand the comment from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=767682#117, it claims that this is fixed since +deb8u3, while we're seeing the opposite, the "current" symlink on carbon points to this image since the 15th of January and we've seen breaking installs after that.

That we're seeing these right might in fact be simply coincidential. The suggested fix from #767682 (passing -F) to mkfs.ext[34] in src:partman-ext3 is not present in any Debian release (not even in current git, see http://anonscm.debian.org/viewvc/d-i/trunk/)

It is part of a Ubuntu-specific patch, though:
https://patches.ubuntu.com/p/partman-ext3/partman-ext3_84ubuntu2.patch

That would explain why using trusty works

while we're seeing the opposite, the "current" symlink on carbon points to this image since the 15th of January

We boot from a custom netboot that I do not know how or who created it, but it is from October (while modules are loaded from the new installer)- if think it is our fault.

I do not think #767682 is related (that is very old) however that last comment provides, with papaul, 2 independent confirmations that the new installer *netboot* works (and the custom one we are using doesn't), and the current installer hasn't been touched since October. It is more probably something like https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=809831 or a kernel mismatch between loaded modules and booted kernel (see on the logs how ext4/ext3 fails to load and mount).

See how the kernel complains on the sylogs when it loads the filesystem modules.

I asked the original creator of the image and he run palladium:~faidon/update-netboot.sh to update it. About to test it.

Change 267681 merged by Jcrespo:
Testing db jessie installer problems on db2030

https://gerrit.wikimedia.org/r/267681

db2030 installed jessie successfully the with the new image, resolved.