Page MenuHomePhabricator

Possible bad mem chip or slot on dbproxy1004
Closed, ResolvedPublic

Description

I'm troubleshooting some (custom) replication lag from dbproxy1004 (AKA m4-master) to dbstore1002 (AKA analytics-store). I just looked at syslog, and for days, this error has been repeating:

Jan 13 19:39:07 dbproxy1004 kernel: [29034022.584596] EDAC MC0: 1 CE error on CPU#0Channel#0_DIMM#0 (channel:0 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)

It doesn't seem to be causing any functional problems (CE == correctable error(?)), but it is probably something that needs fixed, ja? Perhaps this is causing slow downs in the SELECTs done on this MySQL instance for the custom replication?

Event Timeline

Ottomata assigned this task to Cmjohnson.
Ottomata raised the priority of this task from to High.
Ottomata updated the task description. (Show Details)
Ottomata added a project: ops-eqiad.
Ottomata added subscribers: Ottomata, Nuria, Milimetric, jcrespo.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The server is out of warranty but I have several spare DIMM for the R610's on-site. It appears that DIMM A3 is bad and needs to be replaced. I will need about 5 minutes of to replace the DIMM.

Please coordinate best times to make this happen.

Record: 10
Date/Time: 06/12/2015 22:03:14
Source: system
Severity: Non-Critical

Description: Mem ECC Warning: Memory sensor, transition to non-critical from OK ( DIMM_A3 ) was asserted

Record: 11
Date/Time: 06/12/2015 22:03:14
Source: system
Severity: Critical
Description: Mem ECC Warning: Memory sensor, transition to critical from less severe ( DIMM_A3 ) was asserted

@Ottomata: can you coordinate a 5 minutes outage today?

Eeee, I'm not so sure. Are we sure eventlogging is the only user of m4-master?

@Ottomata @Nuria let's coordinate a time that we can get this done.

I think we need to coordinate with @jcrespo. This box is more than just eventlogging db proxy.

No I think dbproxy1004 only serves m4/eventlogging. But we can failover to another machine without needing downtime, I just need time to setup another proxy temporarily.

@Cmjohnson: the update should only be a few minutes right?

If so let's do it today/tomorrow if possible.

@Cmjohson: @madhuvishy is on ops duty this week and she can help coordinate this small maintenance window.

We just need to:

  1. communicate to list
  2. stop el, log to SAL
  3. start el, log to SAL
  4. communicate

@jcrespo @Cmjohnson: EL can handle downtime - We will just stop the EL mysql consumers, and restart them after maintenance window - and data should get reconsumed without loss. Let me know when to kill the consumers, and I can do that.

Since we are about to have an EL downtime anyway, can we fit this in as well?

@Cmjohnson and I will do this Jan 21 16:00 UTC (11am EST).

Should be a very short and unnoticeable downtime.

Replaced the bad DIMM at slot A3