Page MenuHomePhabricator

BBU Fail on dbstore2002
Closed, DeclinedPublic

Description

It was too early to celebrate in T205257 as the BBU is failing again.
AFAIK this host will be decommissioned, but that's months away, so I suggest we should order new battery - as this is an important backup host. After the host get decommissioned we can remove the replacement battery and use it as a spare one @RobH do you have any thoughts about this?

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: Permanently Disabled
   Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)

Related Objects

Event Timeline

Marostegui added a subscriber: Papaul.

I guess this is the reason why dbstore2002:3313 is lagging so much behind? (It could be something else, I am just catching up on emails)
Should we ease a bit replication options to make it catch up?
It is 6 days behind the master now.

Removing Papaul as assignee, as there is nothing for him to do here for now.

I guess this is the reason why dbstore2002:3313 is lagging so much behind? (It could be something else, I am just catching up on emails)
Should we ease a bit replication options to make it catch up?
It is 6 days behind the master now.

Removing Papaul as assignee, as there is nothing for him to do here for now.

Just for the record:
I don't think it is the BBU as only s3 is lagging - none of the other shards are lagging behind.

s3 is the only section there which is not compressed.
Btw. We can check if the BBU causes it, because if we enable write caching we can see the results.

I have eased replication consistency flags and it is now catching up.
What do you mean with "it is not compressed"? that you are running the alter tables to compress it on the codfw master now?

ArielGlenn triaged this task as Medium priority.Nov 13 2018, 9:43 AM

Mentioned in SAL (#wikimedia-operations) [2018-11-13T15:18:12Z] <marostegui> Restore replication consistency options on dbstore2002:3313 as it has caught up - T208320

I just restored the original flags to sync_binlog=1 and trx_commit=1 as s3 caught up.

I have eased replication consistency flags and it is now catching up.
What do you mean with "it is not compressed"? that you are running the alter tables to compress it on the codfw master now?

the tables on dbstore2002 are compressed in every section except s3.

Mentioned in SAL (#wikimedia-operations) [2018-11-20T12:53:53Z] <banyek> setting innodb_flush_log_at_trx_commit to 2 on dbstore2002 (T208320)

Mentioned in SAL (#wikimedia-operations) [2018-11-20T12:55:08Z] <banyek> setting innodb_flush_log_at_trx_commit to 2 on dbstore2002 (s3 instance only!) (T208320)

as the replication lag was 69663 seconds we agreed to set

innodb_flush_log_at_trx_commit=2;

on the host. Now the replication is catching up.

I prepare a patch to remove s2 instance, and give it's resources to s3 to see how it works.

Change 475089 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] mariadb: remove section s2 from dbstore2002

https://gerrit.wikimedia.org/r/475089

Change 475089 merged by Jcrespo:
[operations/puppet@production] mariadb: remove section s2 from dbstore2002

https://gerrit.wikimedia.org/r/475089

What are we doing with this host at the end? It still has BBU error, and the host will be decommisioned, but until that I don't see any reason to keep this open, if we don't order a battery.
Shall we?

@Marostegui I think we should close this task, as the replication is good, and I doubt if we'll replace that BBU befure decommisioning.

Technically the alerts went away after the restart, lets decline it because we know it is not in a good state and it is likely to reappear, but I agree with your assessment.

Agreed with all you guys said
Further, we not only not invest in old hardware but these hosts should go away once we've got the final backups hosts in place