Page MenuHomePhabricator

BBU problems dbstore2002
Closed, DuplicatePublic

Description

On dbstore2002 we have a failing battery in the RAID controller which leads to degraded performance

Cache Board Present: True
   Cache Status: Permanently Disabled
   Cache Status Details: Cache disabled; battery/capacitor failed to charge to an acceptable level
   Cache Ratio: 10% Read / 90% Write
   Drive Write Cache: Disabled

The host's support is expired, we need to replace a battery from a decommissioned host, db2064 would to it, but I need @Papaul
to confirm if the cache batteries are compatible.
We also have an another failing battery in db2042 (T202051#4541285) so we should discuss what to do with one possible battery with two failing hosts.
Maybe do we have another compatible decommissioned hardware - not from db hosts?

Event Timeline

Banyek created this task.Sep 24 2018, 9:45 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 24 2018, 9:45 AM
Marostegui renamed this task from BBU problems dbstore2002 & db2042 to BBU problems dbstore2002.Sep 24 2018, 9:46 AM

If the BBU are compatibles maybe we can:

  • Use db2064's BBU for dbstore2002
  • The new x1 host that has been ordered (T199501#4603837) will replace db2033/or db2069 in x1.
  • Move either db2033/db2069 to become the new m3 master
  • Decommission db2042.
Banyek moved this task from Triage to In progress on the DBA board.Sep 24 2018, 9:51 AM

Both dbstore2002 and db2064 have HP Smart Array P420i Controller

Thanks @Papaul!
So my proposal is to do: T205257#4610104
@jcrespo @Banyek thoughts on that?

Banyek added a comment.Oct 1 2018, 8:07 AM

I like this.

@Papaul I'd like to do the coordination with you about the BBU change from db2064.
The host is an active backup host, it won't be a good idea to work on it during a backup, but if you can pick a good time window for the bbu change, we can either disable the backup, or I can tell you when it is safe to do.
It's up to you

Papaul added a comment.Oct 1 2018, 2:52 PM

You can power the server off tomorrow at 10:00 am CDT

Banyek added a comment.Oct 1 2018, 3:05 PM

@Papaul It will be good thank you.
The backus are normally starting in the same time, but I'll bump them for 2 hours for the next time, so you'll have plenty of time for this

Banyek moved this task from Backlog to In progress on the User-Banyek board.Oct 1 2018, 9:38 PM

Mentioned in SAL (#wikimedia-operations) [2018-10-02T14:11:46Z] <banyek> powering off dbstore2002.codfw.wmnet for BBU change (T205257)

Banyek added a comment.Oct 2 2018, 3:01 PM

@Papaul I stopped the machine you can work on it.

Btw. I was not able to access the idrac console , so maybe you could take a look on that too:

banyek ~  $  ssh dbstore2002.mgmt.codfw.wmnet -lroot
Unable to negotiate with UNKNOWN port 65535: no matching cipher found. Their offer: aes256-cbc,aes128-cbc,3des-cbc
Papaul added a comment.Oct 2 2018, 3:03 PM

@Banyek having problem with my irssi server so can not connect to IRC rebooting my server now will ping you when i get on IRC

@Papaul I just saw your irc ping but I'm not next to my computer. Please talk to @Banyek who is coordinating this.

Thanks!

Papaul added a comment.Oct 2 2018, 5:09 PM

Replacing the Server BBU with the one in db2064 didn't fix the problem. I had to put the original BBU back in the server and after doing that the error went away. This doesn't make sense for me. Can we please leave this task open for the rest of the week?

I also upgrade all the firmware on the server.

Mentioned in SAL (#wikimedia-operations) [2018-10-02T18:24:54Z] <jynus> restarting ferm on dbstore2002 T205257

This is not the first time I see a BBU behaving like that after a reboot or a power drain, the error clears for a few days or even weeks before failing again.
Sometimes it lasts for a few hours and other times for a few weeks even.

the host is back in replication, and the backups were enabled

@Papaul if you want to test if the spare BBU works on another host, we can test it on db2042 (T202051) to see if the server boots up fine or has the same issue as dbstore2002.
If you want to do that test, let me know.

Marostegui triaged this task as Normal priority.Oct 3 2018, 5:41 AM
Banyek moved this task from In progress to Done on the DBA board.Oct 3 2018, 12:25 PM
Papaul added a comment.Oct 3 2018, 3:09 PM

@Marostegui that's okay with me
-

@Papaul great - when do you want to schedule that test?

Papaul added a comment.Oct 3 2018, 4:10 PM

@Marostegui Tomorrow 10am CDT

Better to schedule some other day, tomorrow we have to support the network maintenance which will start one hour later but we will need to do pre maintenance work for sure :-(
Let me know which other day would work for you!
Keep in mind that we have the failover the 10th, so maybe next Thursday same time?

Banyek moved this task from In progress to Blocked on the User-Banyek board.Oct 3 2018, 7:58 PM
Papaul added a comment.Oct 4 2018, 2:07 PM

@Marostegui yes next Thursday works for me.

@Marostegui yes next Thursday works for me.

@Papaul let's move this to some other day. Thursday 11th is right after the failover, and we might have some clean up to do, moreover the following day is a public holiday here, so in case something goes wrong I wouldn't be able to fix it on Friday.
As we are not in a rush with this, let's schedule it at some other time.

Banyek claimed this task.Oct 9 2018, 8:32 AM

@Papaul or you can coordinate with me, I'll be here all week

Marostegui moved this task from Done to In progress on the DBA board.Oct 9 2018, 1:01 PM
Marostegui closed this task as Resolved.Oct 15 2018, 5:28 AM
Marostegui reassigned this task from Banyek to Papaul.

This is no longer about dbstore2002 but about db2042, so let's follow on that task: T202051
dbstore2002 is good for now, so let's close this and re-open if necessary:

root@dbstore2002:~# hpssacli ctrl all show status

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK
jcrespo reopened this task as Open.Jan 10 2019, 11:56 AM

Failing again, acking on icinga, reopening to not forget about it.

Marostegui removed Papaul as the assignee of this task.Jan 14 2019, 5:01 PM

Nothing for Papaul to do here for now.

root@dbstore2002:~# hpssacli ctrl all show status

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: Temporarily Disabled
   Battery/Capacitor Status: Recharging
Cache Board Present: True
Cache Status: Temporarily Disabled
Cache Status Details: Cache disabled; battery/capacitor charge is low.
Cache Ratio: 10% Read / 90% Write

Hopefully this host will be decommissioned in favour of the new dbprov2001 and dbprov2002 T218336: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts