Page MenuHomePhabricator

BBU problems dbstore2002
Closed, DuplicatePublic

Description

On dbstore2002 we have a failing battery in the RAID controller which leads to degraded performance

Cache Board Present: True
   Cache Status: Permanently Disabled
   Cache Status Details: Cache disabled; battery/capacitor failed to charge to an acceptable level
   Cache Ratio: 10% Read / 90% Write
   Drive Write Cache: Disabled

The host's support is expired, we need to replace a battery from a decommissioned host, db2064 would to it, but I need @Papaul
to confirm if the cache batteries are compatible.
We also have an another failing battery in db2042 (T202051#4541285) so we should discuss what to do with one possible battery with two failing hosts.
Maybe do we have another compatible decommissioned hardware - not from db hosts?

Event Timeline

Marostegui renamed this task from BBU problems dbstore2002 & db2042 to BBU problems dbstore2002.Sep 24 2018, 9:46 AM

If the BBU are compatibles maybe we can:

  • Use db2064's BBU for dbstore2002
  • The new x1 host that has been ordered (T199501#4603837) will replace db2033/or db2069 in x1.
  • Move either db2033/db2069 to become the new m3 master
  • Decommission db2042.

Both dbstore2002 and db2064 have HP Smart Array P420i Controller

@Papaul I'd like to do the coordination with you about the BBU change from db2064.
The host is an active backup host, it won't be a good idea to work on it during a backup, but if you can pick a good time window for the bbu change, we can either disable the backup, or I can tell you when it is safe to do.
It's up to you

You can power the server off tomorrow at 10:00 am CDT

@Papaul It will be good thank you.
The backus are normally starting in the same time, but I'll bump them for 2 hours for the next time, so you'll have plenty of time for this

Mentioned in SAL (#wikimedia-operations) [2018-10-02T14:11:46Z] <banyek> powering off dbstore2002.codfw.wmnet for BBU change (T205257)

@Papaul I stopped the machine you can work on it.

Btw. I was not able to access the idrac console , so maybe you could take a look on that too:

banyek ~  $  ssh dbstore2002.mgmt.codfw.wmnet -lroot
Unable to negotiate with UNKNOWN port 65535: no matching cipher found. Their offer: aes256-cbc,aes128-cbc,3des-cbc

@Banyek having problem with my irssi server so can not connect to IRC rebooting my server now will ping you when i get on IRC

@Papaul I just saw your irc ping but I'm not next to my computer. Please talk to @Banyek who is coordinating this.

Thanks!

Replacing the Server BBU with the one in db2064 didn't fix the problem. I had to put the original BBU back in the server and after doing that the error went away. This doesn't make sense for me. Can we please leave this task open for the rest of the week?

I also upgrade all the firmware on the server.

Mentioned in SAL (#wikimedia-operations) [2018-10-02T18:24:54Z] <jynus> restarting ferm on dbstore2002 T205257

This is not the first time I see a BBU behaving like that after a reboot or a power drain, the error clears for a few days or even weeks before failing again.
Sometimes it lasts for a few hours and other times for a few weeks even.

the host is back in replication, and the backups were enabled

@Papaul if you want to test if the spare BBU works on another host, we can test it on db2042 (T202051) to see if the server boots up fine or has the same issue as dbstore2002.
If you want to do that test, let me know.

@Papaul great - when do you want to schedule that test?

Better to schedule some other day, tomorrow we have to support the network maintenance which will start one hour later but we will need to do pre maintenance work for sure :-(
Let me know which other day would work for you!
Keep in mind that we have the failover the 10th, so maybe next Thursday same time?

@Marostegui yes next Thursday works for me.

@Papaul let's move this to some other day. Thursday 11th is right after the failover, and we might have some clean up to do, moreover the following day is a public holiday here, so in case something goes wrong I wouldn't be able to fix it on Friday.
As we are not in a rush with this, let's schedule it at some other time.

@Papaul or you can coordinate with me, I'll be here all week

Marostegui reassigned this task from Banyek to Papaul.

This is no longer about dbstore2002 but about db2042, so let's follow on that task: T202051
dbstore2002 is good for now, so let's close this and re-open if necessary:

root@dbstore2002:~# hpssacli ctrl all show status

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

Failing again, acking on icinga, reopening to not forget about it.

Nothing for Papaul to do here for now.

root@dbstore2002:~# hpssacli ctrl all show status

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: Temporarily Disabled
   Battery/Capacitor Status: Recharging
Cache Board Present: True
Cache Status: Temporarily Disabled
Cache Status Details: Cache disabled; battery/capacitor charge is low.
Cache Ratio: 10% Read / 90% Write

Hopefully this host will be decommissioned in favour of the new dbprov2001 and dbprov2002 T218336: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts