Page MenuHomePhabricator

LVM recipes broken for jessie, set up all remaining LVM space as swap
Closed, ResolvedPublic

Description

copper:~$ free -m
             total       used       free     shared    buffers     cached
Mem:          7993       7442        550        160        810       5717
-/+ buffers/cache:        914       7078
Swap:       429123          0     429123
copper:~$ cat /proc/partitions 
major minor  #blocks  name

   8        0  488386584 sda
   8        1   48827392 sda1
   8        2  439557120 sda2
   8       16  488386584 sdb
   8       17   48827392 sdb1
   8       18  439557120 sdb2
   9        0   48794624 md0
   9        1  439426048 md1
 253        0  439422976 dm-0
copper:~$

Event Timeline

fgiunchedi raised the priority of this task from to Needs Triage.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added a project: acl*sre-team.
fgiunchedi subscribed.
faidon renamed this task from wrong partitioning scheme for copper (500GB of swap) to raid1-lvm recipe broken for jessie, sets up available LVM space as swap.May 28 2015, 5:53 PM
faidon triaged this task as Unbreak Now! priority.
faidon set Security to None.
faidon subscribed.

This happened on a reinstall I did today as well, for baham. There was a swap LV, occupying all of the available space on the VG (420G).

This needs to be fixed ASAP and we also need to audit existing installs.

The following servers use the raid1-lvm partman recipe:

acamar|achernar|baham|cobalt|lead|lithium|polonium|rhodium
argon|bast4001|copper|neon|ruthenium|subra|suhail|titanium|ytterbium|zirconium
lvs[34]00*
labmon1001
wtp[1-2]0[0-2][0-9]|hafnium

size of swap in megabytes

achernar: 951
acamar: 951
baham: 0
cobalt: n/a - doesn't exist except mgmt
lead: 951
lithium: 951
polonium: 951
rhodium: 951
argon: 951
bast4001: 7628
copper: 429123 (!)
neon: 8119
ruthenium: 951
subra: 951
suhail: 951
titanium: 951
ytterbium: 951
zirconium: 951
lvs3001-3004: all 951
lvs4001-4001: all 951
labmon1001: 2047
wtp* : all 951 , _except wtp1002_: 7627
hafnium: 951

so "copper" is affected, wtp1001 and bast4001 are different from the others, but only like 8G

It's a bit strange. For example i installed "subra" and "suhail" in codfw, so i know it's not that long ago and they are ok. A git log on the "raid1-lvm.cfg" shows the last change was October 2013. So what else would have changed?

It's a bit strange. For example i installed "subra" and "suhail" in codfw, so i know it's not that long ago and they are ok. A git log on the "raid1-lvm.cfg" shows the last change was October 2013. So what else would have changed?

My suspicion is that this is jessie-related, as evident from the bug title. subra & suhail are both trusty. Do you have other cases with jessie hosts where this recipe (or some other LVM recipe) worked properly?

Ubuntu's partman-auto-lvm changelog for trusty mentions:

partman-auto-lvm (51ubuntu1) trusty; urgency=low

  * Resynchronise with Debian. Remaining changes:
    - Accept autopartitioning automatically.
    - Change fallback VG name to Ubuntu.
    - Ask how much of the VG should be used for logical volumes, rather than
      unconditionally using it all.
    - Set locale to "C" when calling vgs for available free space.

Note the "unconditionally using it all".

The cross-diff between 51 and 51ubuntu1 seems to include this:

-       if [ "$last" = yes ]; then
+       if [ "$last" = yes ] && [ "$use_all" ]; then
                vg_get_info "$VG_name"
                lv_create $VG_name "$lvname" $FREEPE || autopartitioning_failed
        else

With $use_all being set by a check against $guided_size which in turn has its value dependent on a couple of newly-introduced settings for a min/max guided size.

This does not exist in Debian (I'm not sure if it was ever pushed upstream). Debian's code seems to expand the "last" partition to the full extent of the LVM VG, confirming our suspicions.

We probably need to hack this around by creating a stub, empty LV at the end of our VGs. Note that this is not raid1-lvm specific, but affects all LVM recipes too.

faidon renamed this task from raid1-lvm recipe broken for jessie, sets up available LVM space as swap to LVM recipes broken for jessie, set up all remaining LVM space as swap.May 29 2015, 10:45 AM

on the debian side similar issues are reported as #517935 or #385219 and the specific bug above is #515607

Change 214608 had a related patch set uploaded (by Filippo Giunchedi):
install-server: add WMF5842 back as d-i-test

https://gerrit.wikimedia.org/r/214608

picking this up, likely the solution will involve a fake LV

Change 215643 had a related patch set uploaded (by Filippo Giunchedi):
move d-i-test to row C

https://gerrit.wikimedia.org/r/215643

Change 215643 merged by Filippo Giunchedi:
move d-i-test to row C

https://gerrit.wikimedia.org/r/215643

Change 215802 had a related patch set uploaded (by Filippo Giunchedi):
install-server: provision d-i-test as ganeti VM

https://gerrit.wikimedia.org/r/215802

Change 214608 abandoned by Filippo Giunchedi:
install-server: add WMF5842 back as d-i-test

Reason:
superseded by https://gerrit.wikimedia.org/r/#/c/215802/

https://gerrit.wikimedia.org/r/214608

Change 215802 merged by Filippo Giunchedi:
install-server: provision d-i-test as ganeti VM

https://gerrit.wikimedia.org/r/215802

Change 215806 had a related patch set uploaded (by Filippo Giunchedi):
install-server: create placeholder LV to work around partman-lvm bug

https://gerrit.wikimedia.org/r/215806

Change 215806 merged by Filippo Giunchedi:
install-server: create placeholder LV to work around partman-lvm bug

https://gerrit.wikimedia.org/r/215806

tentatively resolved, I've tested the new recipes with jessie and trusty and didn't see regressions in trusty