Page MenuHomePhabricator

debian-installer: partman doesn't allow lvm LVs to be reused when reimaging
Closed, ResolvedPublic

Description

The d-i partitioner, partman, wipes all lvm VGs/LVs as part of initialization. You can tell it not to, with d-i partman-lvm/device_remove_lvm boolean false, but then auto-partitioning doesn't continue. This makes it impossible to re-use an existing lvm LV when doing automated partitioning.

Details:

Event Timeline

colewhite triaged this task as Medium priority.May 6 2020, 4:48 PM

It is (probably) possible to work around this by not setting partman-auto/method and using partman/early_command to prepopulate the metadata that partman uses. It's not clear how much work is involved, though, beyond "some".

I think it would merit a PoC at least, so we can evaluate if it is 100% possible, so we can decide how to proceed further.
We need to involve the Infra Foundations team on this though cc @MoritzMuehlenhoff

After a lot more investigation, this is bordering on infeasible. It would require re-implementing a Lot of partman in order to make it work.

Some notes:

  • partman is a series of shell scripts, roughly 11.5k lines across 130+ files.
  • It spawns a parted server, and interacts it by doing rpc over 2 fifos.
  • The code undocumented, and full of unspecified interdependencies.
  • partman/early_command runs before any real setup is done. Trying to manually trigger the setup parts we need without invoking parts we don't want is very complex. You need a fairly complete understanding of how everything fits together to make this work.

Here's as far as i've managed to get:

d-i     partman/early_command   string \
        . /lib/partman/lib/base.sh; \
        . /lib/partman/lib/recipes.sh; \
        . /lib/partman/lib/auto-shared.sh; \
        set -ex; \
        mkdir -p /var/lib/partman; \
        for i in /lib/partman/init.d/*; do echo $i | grep -q 'early_command$' || $i; done; \
        touch /var/lib/partman/initial_auto; \
        perform_recipe $(dev_to_partman $(debconf-get partman-auto/disk)) "" es

It causes d-i to hang at "Starting up the partitioner".

Sad news :-/
If that is the case, I don't think there is much point on going forward with this, specially if we'd need to patch partman in many places - and then maintaining those patches.

I have a solution. A script that re-uses the existing / (formatted) and /srv (retained as-is) partitions by hooking into the partman internals.

1#!/bin/sh
2
3[ -e /tmp/reuse ] && exit 0;
4touch /tmp/reuse
5
6#export DEBIAN_FRONTEND=noninteractive
7. /lib/partman/lib/base.sh
8. /lib/partman/lib/recipes.sh
9
10log() {
11 logger -t reuse "$@"
12}
13
14[ $(debconf-get partman-auto/method) = "reuse" ] || { log "Skipping, partman-auto/method != reuse"; exit 0; }
15
16disk=$(debconf-get partman-auto/disk)
17log "Disk: $disk"
18dev="$DEVICES/$(echo $disk | sed 's:/:=:g')"
19cd $dev || { log "ERROR: $disk doesn't exist"; exit 1; }
20
21open_dialog PARTITIONS
22while { read_line num id size type fs path name; [ "$id" ]; }; do
23 case "$num" in
24 1)
25 log "====> Reuse / partition (with format)"
26 [ "$fs" = "ext4" ] || { log "ERROR: expected fs ext4 for partition $num, got $fs instead"; break; }
27 echo format > $id/method
28 touch $id/format
29 touch $id/use_filesystem
30 echo "$fs" > $id/filesystem
31 mkdir -p $id/options
32 echo / > $id/mountpoint
33 touch $id/existing
34 touch $id/formatable
35 ;;
36 esac
37done
38close_dialog
39
40cd "$DEVICES"/=dev=mapper=* || { log "ERROR: wrong number of lvm logical volumes found: $(echo "$DEVICES"/=dev=mapper=*)"; exit 1; }
41log "====> Reuse /srv lvm partition (retain data)"
42open_dialog PARTITIONS
43read_line num id size type fs path name
44[ "$fs" = "xfs" ] || { log "ERROR: expected fs xfs for partition $num, got $fs instead"; break; }
45echo keep > $id/method
46touch $id/existing
47echo "$fs" > $id/filesystem
48echo /srv > $id/mountpoint
49mkdir -p $id/options
50touch $id/use_filesystem
51close_dialog
52
53log "Running update_all"
54update_all
55log "update_all result: $?"

Thanks for working on this! I'd need to read up on some of the Partman guts for a full review, but the general approach seems totally workable.

As for the question on how to include this script in the install environment; wget-ing it in partman/early_command is actually the sanest approach for now. With a PXE boot most of d-i gets downloaded via udebs from the Debian mirrors, so patching partman packages locally isn't an alternative here. We do rebuild the images currently, but this only adds firmware needed by some NICs, it doesn't rebuild debian-installer by itself.

That said, we're definitely not the only ones with the need to reimage a Debian server while retaining /srv. One other option (in addition to start with using partman/early_command) would be make the script a little more generic (like adding a profile option partman/retain_partition=/srv which enables your script into /lib/partman/display.d/ and submit it as a bug report and merge request on salsa.debian.org. When this lands into the next Debian release we can then simply switch to it and others can also use and enhance it.

Change 601761 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] install_server: Allow reuse of partitions during reimage. [WIP]

https://gerrit.wikimedia.org/r/601761

Change 601761 merged by Kormat:
[operations/puppet@production] install_server: Allow reuse of partitions during reimage.

https://gerrit.wikimedia.org/r/601761

Change 603961 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] install_server: Test reuse-parts on metal.

https://gerrit.wikimedia.org/r/603961

Change 603961 merged by Kormat:
[operations/puppet@production] install_server: Test reuse-parts on metal.

https://gerrit.wikimedia.org/r/603961

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['sretest1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006091223_kormat_33763.log.

Completed auto-reimage of hosts:

['sretest1002.eqiad.wmnet']

Of which those FAILED:

['sretest1002.eqiad.wmnet']

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['sretest1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006091348_kormat_101863.log.

Completed auto-reimage of hosts:

['sretest1002.eqiad.wmnet']

Of which those FAILED:

['sretest1002.eqiad.wmnet']

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['sretest1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006091351_kormat_104069.log.

Completed auto-reimage of hosts:

['sretest1002.eqiad.wmnet']

and were ALL successful.

Change 604020 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] install_server: Fix issue with reuse-parts.cfg on multi-disk machines.

https://gerrit.wikimedia.org/r/604020

Change 604020 merged by Kormat:
[operations/puppet@production] install_server: Fix issue with reuse-parts.cfg on multi-disk machines.

https://gerrit.wikimedia.org/r/604020

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['sretest1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006091427_kormat_134282.log.

Completed auto-reimage of hosts:

['sretest1002.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['db1077.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006100751_kormat_18957.log.

Completed auto-reimage of hosts:

['db1077.eqiad.wmnet']

Of which those FAILED:

['db1077.eqiad.wmnet']

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['db1077.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006100812_kormat_35212.log.

Completed auto-reimage of hosts:

['db1077.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['db1077.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006100856_kormat_69412.log.

Completed auto-reimage of hosts:

['db1077.eqiad.wmnet']

and were ALL successful.

Change 604315 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] install_server: Fix reuse-db.cfg recipe

https://gerrit.wikimedia.org/r/604315

Change 604315 merged by Kormat:
[operations/puppet@production] install_server: Fix reuse-db.cfg recipe

https://gerrit.wikimedia.org/r/604315

Kormat closed this task as Resolved.EditedJun 10 2020, 9:37 AM
Kormat claimed this task.

I'm pronouncing this Resolved \o/

I've successfully reimaged sretest1002 and db1007 using reuse-parts. We're now going to start using it more generally. I've opened T254982: reuse-parts.sh: provide feedback to user when something fails to follow-up on, but this task can now be closed.

Anyone who wants to use reuse-parts: i'm very happy to provide guidance/debugging if something fails, etc. Just poke me on irc, or file a task and assign it to me.

  • kormat (the partman guy now, apparently)

Change 604636 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] install_server: Revert d-i-test to original partition scheme.

https://gerrit.wikimedia.org/r/604636

Change 604636 merged by Kormat:
[operations/puppet@production] install_server: Revert d-i-test to original partition scheme.

https://gerrit.wikimedia.org/r/604636

Change 618718 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install: Prevent full wipe of dbprov2003 data by changing its recipe

https://gerrit.wikimedia.org/r/618718

Change 618718 merged by Jcrespo:
[operations/puppet@production] install: Prevent full wipe of dbprov2003 data by changing its recipe

https://gerrit.wikimedia.org/r/618718