Page MenuHomePhabricator

Replace db2044 with db2063
Closed, DeclinedPublic

Description

db2044 is currently m2 codfw master, this host has a broken disk and has had many disks failures in the past. It will be decommissioned.
Let's replace it with db2063.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Aug 14, 7:02 AM
Marostegui triaged this task as Normal priority.Wed, Aug 14, 7:03 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Change 530034 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw,db-eqiad.php: Remove db2063 from config

https://gerrit.wikimedia.org/r/530034

Change 530034 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw,db-eqiad.php: Remove db2063 from config

https://gerrit.wikimedia.org/r/530034

Mentioned in SAL (#wikimedia-operations) [2019-08-14T07:08:03Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove db2063 from config T230459 (duration: 00m 48s)

Mentioned in SAL (#wikimedia-operations) [2019-08-14T07:09:05Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Remove db2063 from config T230459 (duration: 00m 47s)

Change 530035 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Move db2063 from s2 to m2

https://gerrit.wikimedia.org/r/530035

Change 530035 merged by Marostegui:
[operations/puppet@production] mariadb: Move db2063 from s2 to m2

https://gerrit.wikimedia.org/r/530035

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db2063.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201908140717_marostegui_75740.log.

Completed auto-reimage of hosts:

['db2063.codfw.wmnet']

Of which those FAILED:

['db2063.codfw.wmnet']
Marostegui added a subscriber: Papaul.EditedWed, Aug 14, 8:49 AM

I have been trying to PXE boot this host but it has been impossible.
Even though I have manually set the PXE from the ipmitool locally it is still not working:

root@db2063:~# ipmitool  chassis bootparam get 5
Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0004000000
 Boot Flags :
   - Boot Flag Invalid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : Force PXE
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

After a reboot it keeps booting from disk despite of the above options.

The ipmi tool fails remotely, with:

Error: Unable to establish IPMI v2 / RMCP+ session

I have followed all the steps at https://wikitech.wikimedia.org/wiki/Management_Interfaces including re-seating the card from the mgmt interface without any luck.
Also tried to jump into the boot menu while the host boots, but it doesn't get into and continues to boot from disk.

@Papaul could you manually reset the idrac by powering the host down and doing a power drain (https://wikitech.wikimedia.org/wiki/Management_Interfaces#Power_drain_the_host) and upgrading the idrac's firmware to see if I can manage to install it?

Thanks

Forgot to mention that this host is not in use and it is downtimed, so this onsite maintenance can be done anytime without heads-up to the DBAs

ssh issue

papaul@papaulpc:~$ ssh root@db2063.mgmt.codfw.wmnet
Unable to negotiate with UNKNOWN port 65535: no matching cipher found. Their offer: aes256-cbc,aes128-cbc,3des-cbc

I have to have run ssh command with -c aes256-cbc to access mgmt

papaul@papaulpc:~$ ssh -c aes256-cbc root@db2063.mgmt.codfw.wmnet
root@db2063.mgmt.codfw.wmnet's password:

After the upgrade and resting the ILO I was able to access the mgmt without the -c aes256-cbc

I did the test on 6 other db servers same generation (db206[124567] i am getting also the same ssh error

Not sure what is the status of this, considering T228258 exists. db2063 mysql is down, but I ain't touching it just to prevent breaking something.

Marostegui closed this task as Declined.Mon, Aug 19, 5:28 AM

This host is still failing with the idrac not being able to work.
I think I will just decommission this one and pick another one, no need to waste more time with it.