Page MenuHomePhabricator

Test and fix db1047 BBU
Closed, ResolvedPublic

Description

db1047 is lagging regularly due to a failing BBU unit:

Battery Replacement required            : Yes
Remaining Capacity Low                  : Yes

Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAdaptive, Direct, No Write Cache if Bad BBU

A failover for research users would be needed to dbstore1002 to investigate the issues.

Not involving yet ops-eqiad.

Event Timeline

jcrespo claimed this task.
jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added projects: Research, DBA, acl*sre-team.
jcrespo added subscribers: jcrespo, Springle, Halfak.

@jcrespo Could you please expand on what you need here from the Research team? Thanks!

@ggellerman While the technical parts on this ticket will be handled by SRE, there are 2 things:

  • Acknowledgement of a problem on one of the servers that I understand your team is one of the biggest users (correct me if I am wrong), which may or may not mean slower queries/stale data
  • At some point, the server may be either 1) temporarily unavailable for repairing or 2) permanently decommissioned and that may affect you normal operations. In both cases we may need coordination to provide the service on another node (and we always seek for minimum impact on your normal work).

I will keep you updated with future developments on this if you are interested.

Low priority because for now the hosts works as intended.

I believe this server is also affected by the faulty TokuDB replication I mentioned on T100408#1323731, because restarting replication helps with lag.

I want to reboot and upgrade it. CC @Halfak

Low priority because for now the hosts works as intended.

I may be mistaken here, but lagging regularly is not "works as intended"

@Peachey88, please expand about the problems that you are currently suffering due to db1047.

Lag problems have been solved, although hardware renewal is still needed.