Maniphest T208320

BBU Fail on dbstore2002
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• Banyek
	Oct 30 2018, 2:42 PM

Description

It was too early to celebrate in T205257 as the BBU is failing again.
AFAIK this host will be decommissioned, but that's months away, so I suggest we should order new battery - as this is an important backup host. After the host get decommissioned we can remove the replacement battery and use it as a spare one @RobH do you have any thoughts about this?

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: Permanently Disabled
   Battery/Capacitor Status: Failed (Replace Batteries/Capacitors)

Details

	Subject	Repo	Branch	Lines +/-
	mariadb: remove section s2 from dbstore2002	operations/puppet	production	+4 -5

Customize query in gerrit

Related Objects

Mentioned Here: T205257: BBU problems dbstore2002

Event Timeline

• Banyek created this task.Oct 30 2018, 2:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 30 2018, 2:42 PM

jcrespo moved this task from Triage to Backlog on the DBA board.Oct 30 2018, 2:51 PM

I guess this is the reason why dbstore2002:3313 is lagging so much behind? (It could be something else, I am just catching up on emails)
Should we ease a bit replication options to make it catch up?
It is 6 days behind the master now.

Removing Papaul as assignee, as there is nothing for him to do here for now.

In T208320#4738624, @Marostegui wrote:

I guess this is the reason why dbstore2002:3313 is lagging so much behind? (It could be something else, I am just catching up on emails)
Should we ease a bit replication options to make it catch up?
It is 6 days behind the master now.

Removing Papaul as assignee, as there is nothing for him to do here for now.

Just for the record:
I don't think it is the BBU as only s3 is lagging - none of the other shards are lagging behind.

s3 is the only section there which is not compressed.
Btw. We can check if the BBU causes it, because if we enable write caching we can see the results.

I have eased replication consistency flags and it is now catching up.
What do you mean with "it is not compressed"? that you are running the alter tables to compress it on the codfw master now?

ArielGlenn triaged this task as Medium priority.Nov 13 2018, 9:43 AM

Mentioned in SAL (#wikimedia-operations) [2018-11-13T15:18:12Z] <marostegui> Restore replication consistency options on dbstore2002:3313 as it has caught up - T208320

I just restored the original flags to sync_binlog=1 and trx_commit=1 as s3 caught up.

In T208320#4738830, @Marostegui wrote:

I have eased replication consistency flags and it is now catching up.
What do you mean with "it is not compressed"? that you are running the alter tables to compress it on the codfw master now?

the tables on dbstore2002 are compressed in every section except s3.

Mentioned in SAL (#wikimedia-operations) [2018-11-20T12:53:53Z] <banyek> setting innodb_flush_log_at_trx_commit to 2 on dbstore2002 (T208320)

Mentioned in SAL (#wikimedia-operations) [2018-11-20T12:55:08Z] <banyek> setting innodb_flush_log_at_trx_commit to 2 on dbstore2002 (s3 instance only!) (T208320)

as the replication lag was 69663 seconds we agreed to set

innodb_flush_log_at_trx_commit=2;

on the host. Now the replication is catching up.

• Banyek added a project: User-Banyek.Nov 21 2018, 1:17 PM

• Banyek moved this task from Backlog to In progress on the User-Banyek board.

I prepare a patch to remove s2 instance, and give it's resources to s3 to see how it works.

Change 475089 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] mariadb: remove section s2 from dbstore2002

https://gerrit.wikimedia.org/r/475089

gerritbot added a project: Patch-For-Review.Nov 21 2018, 1:23 PM

Change 475089 merged by Jcrespo:
[operations/puppet@production] mariadb: remove section s2 from dbstore2002

https://gerrit.wikimedia.org/r/475089

What are we doing with this host at the end? It still has BBU error, and the host will be decommisioned, but until that I don't see any reason to keep this open, if we don't order a battery.
Shall we?

@Marostegui I think we should close this task, as the replication is good, and I doubt if we'll replace that BBU befure decommisioning.

Technically the alerts went away after the restart, lets decline it because we know it is not in a good state and it is likely to reappear, but I agree with your assessment.

Agreed with all you guys said
Further, we not only not invest in old hardware but these hosts should go away once we've got the final backups hosts in place

BBU Fail on dbstore2002Closed, DeclinedPublicActions

Description

Details

Related Objects

Event Timeline

BBU Fail on dbstore2002
Closed, DeclinedPublic
Actions