Page MenuHomePhabricator

Upgrade AQS to Debian Stretch
Closed, ResolvedPublic8 Estimated Story Points

Description

Most of the groundwork has been done in https://phabricator.wikimedia.org/T195741, but the idea is:

  • test the new Cassandra package (either upstream 2.2.12/13 or our own patched 2.2.6)
  • reimage one host at the time to stretch, ensuring that /srv/ partitions are not touched (possibly configuring the d-i's partitioner manually).

At the moment Restbase hosts are still on Jessie, and only maps-test is testing Cassandra on stretch.

Event Timeline

Updating after a chat with Eric. In theory, if we manage to keep the current Raid 10 config on each node and format only the root partition, when the instance boots up we should be able to just start cassandra and that's it.

In case by accident the partitions are wiped, then the following procedure needs to happen:

http://cassandra.apache.org/doc/latest/operating/topo_changes.html#replacing-a-dead-node

The note about hints in the link should not be a concern since we bulk load once every hour, so if we don't "write" anything in that period of time we should be extra safe (so disabling load jobs from hue before starting would be a good paranoid and safe action to do).

Vvjjkkii renamed this task from Upgrade AQS to Debian Stretch to 1ubaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.

Change 443421 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] netboot.cfg: temp remove aqs hosts to allow manual work during d-i

https://gerrit.wikimedia.org/r/443421

Change 443421 merged by Elukey:
[operations/puppet@production] netboot.cfg: temp remove aqs hosts to allow manual work during d-i

https://gerrit.wikimedia.org/r/443421

Change 443422 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set aqs* PXE boot to Debian Stretch

https://gerrit.wikimedia.org/r/443422

Change 443422 merged by Elukey:
[operations/puppet@production] Set aqs* PXE boot to Debian Stretch

https://gerrit.wikimedia.org/r/443422

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['aqs1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201807021342_elukey_19697.log.

First reimage returned this:

14:06:48 | aqs1004.eqiad.wmnet | WARNING: unable to verify that BIOS boot parameters are back to normal, got:
Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0004000000
 Boot Flags :
   - Boot Flag Invalid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : Force PXE
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

The highlight goes to Boot Device Selector : Force PXE. For some reason, this may happen. I used the following commands to restore sane defaults and then check:

elukey@neodymium:~$ sudo ipmitool -I lanplus -H "aqs1004.mgmt.eqiad.wmnet" -U root -E chassis bootdev none
[..]
Set Boot Device to none


elukey@neodymium:~$ sudo ipmitool -I lanplus -H "aqs1004.mgmt.eqiad.wmnet" -U root -E chassis bootparam get 5
[..]
Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 8000000000
 Boot Flags :
   - Boot Flag Valid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : No override
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

Completed auto-reimage of hosts:

['aqs1004.eqiad.wmnet']

and were ALL successful.

CommunityTechBot lowered the priority of this task from High to Medium.Jul 3 2018, 3:25 AM

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['aqs1005.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201807030740_elukey_17200.log.

Completed auto-reimage of hosts:

['aqs1005.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['aqs1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201807030910_elukey_3193.log.

Completed auto-reimage of hosts:

['aqs1006.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['aqs1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201807031139_elukey_11937.log.

Completed auto-reimage of hosts:

['aqs1007.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['aqs1008.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201807040618_elukey_21239.log.

Completed auto-reimage of hosts:

['aqs1008.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['aqs1009.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201807040917_elukey_7569.log.

Completed auto-reimage of hosts:

['aqs1009.eqiad.wmnet']

and were ALL successful.

elukey set the point value for this task to 8.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.