Page MenuHomePhabricator

Make enabling reimaging for db hosts more humane
Closed, ResolvedPublic

Description

Currently there's a case in modules/install_server/files/autoinstall/netboot.cfg which looks like this:

db[01][0-9][0-9]|dbstore100[3-5]|db2[01][0-9][0-9]|es101[1-9]|es201[1-9]|pc[12]00[7-9]|pc[12]010|labsdb1009|labsdb1010|labsdb1011|labsdb1012|dbprov[12]00[12]) echo partman/custom/no-srv-format.cfg ;; \

This matches ~all db hosts, and causes reimaging to fail with a deliberately invalid partman recipe to avoid /srv from getting reformatted and causing data loss.

In order to reimage a db server, then, requires that glob pattern to be updated to specifically exclude a single host. E.g. changing the above to exclude db2087:

db[01][0-9][0-9]|dbstore100[3-5]|db20[0-7][0-9]|db208[0-6]|db208[8-9]|db209[0-9]|db21[0-9][0-9]|es101[1-9]|es201[1-9]|pc[12]00[7-9]|pc[12]010|labsdb1009|labsdb1010|labsdb1011|labsdb1012|dbprov[12]00[12]) echo partman/custom/no-srv-format.cfg ;; \

This is very error prone as it is hard to write and hard to review.

In a better world, it would be possible to specify explicitly what host is allowed to be reimaged, and have all db hosts matched with no-srv-format.cfg as a fallback. As the file appears to be a large bash case statement, in theory it should be possible to just add a simple case above the all-db-host case like this:

db2087) ;; \
db[01][0-9][0-9]|dbstore100[3-5]|db2[01][0-9][0-9]|es101[1-9]|es201[1-9]|pc[12]00[7-9]|pc[12]010|labsdb1009|labsdb1010|labsdb1011|labsdb1012|dbprov[12]00[12]) echo partman/custom/no-srv-format.cfg ;; \

That should then match only the desired host, and make it very easy to see what's happening (and to revert it later by hand if necessary).

This needs testing to ensure that it works. @Marostegui : can you suggest a test host i can try this with?

Event Timeline

If you accept some input, making partman/custom/no-srv-format.cfg a recipe that works but doesn't touch the /srv lvm partition would solve most of our problems (combined with the dynamic flag at T251416). That way, it wouldn't need to change except on first reimage/first setup.

You can test partman easily on a ganeti vm.

@jcrespo : i'm happy to work on that, but i'd like to do the proposed change in this task first. partman is voodoo anytime i've touched it, so it will take some time and some care to change the partman recipe, and in the meantime it'll at least make the process a little less of a minefield.

I see now, sorry, I didn't understood the proposed scope of work first time I read it. +10000 for me.

Mentioned in SAL (#wikimedia-operations) [2020-04-30T09:47:17Z] <kormat> reimaging db1077 for testing purposes T251392

Change 593471 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] install_server: Allow reimage of db1077

https://gerrit.wikimedia.org/r/593471

Change 593471 merged by Kormat:
[operations/puppet@production] install_server: Allow reimage of db1077

https://gerrit.wikimedia.org/r/593471

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['db1077.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202004301143_kormat_239693.log.

Completed auto-reimage of hosts:

['db1077.eqiad.wmnet']

and were ALL successful.

Success: using db1007) ;; \ in netboot.cfg achieved the (short-term) goal of allowing us to use manual partitioning.

Change 593490 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] Revert "install_server: Allow reimage of db1077"

https://gerrit.wikimedia.org/r/593490

Change 593490 merged by Kormat:
[operations/puppet@production] Revert "install_server: Allow reimage of db1077"

https://gerrit.wikimedia.org/r/593490

Kormat claimed this task.

Closing this, and opened T251768 to cover fixing the partman recipe.