Memory error on restbase1016
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• MoritzMuehlenhoff
	Dec 20 2018, 12:17 PM

Description

When rebooting restbase1016, it stopped the boot complaining about a broken memory DIMM:

UEFI0107: One or more memory errors have occurred on memory slot: A2. Remove 
input power to the system, reseat the DIMM module and restart the system. If the 
issues persist, replace the faulty memory module identified in the message.

UEFI0081: Memory configuration has changed from the last time the system was
started.
If the change is expected, no action is necessary. Otherwise, check the DIMM
population inside the system and memory settings in System Setup.

UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory
Module (DIMM) is not functioning.
Check the System Event Log (SEL) to identify the non-functioning DIMM, and then
replace it.

Details

	Subject	Repo	Branch	Lines +/-
	install_server: reimage restbase1016 with stretch	operations/puppet	production	+0 -2

Customize query in gerrit

Related Objects

Mentioned In: T214166: Improve cassandra JBOD integration post-reimage
T213859: eqiad: rack a3 pdu swap / failure / replacement
T212424: restbase cassandra driver excessive logging when cassandra hosts are down
Mentioned Here: T209136: python3-etcd needs python3-dnspython

Event Timeline

• MoritzMuehlenhoff created this task.Dec 20 2018, 12:17 PM

fgiunchedi mentioned this in T212424: restbase cassandra driver excessive logging when cassandra hosts are down.Dec 20 2018, 1:30 PM

• mobrovac added projects: RESTBase, RESTBase-Cassandra, Services (watching), Platform Team Legacy (Watching / External).Dec 27 2018, 5:33 PM

Change 481495 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Also increase temporarily the delay because of T212418

https://gerrit.wikimedia.org/r/481495

Change 481495 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Also increase temporarily the delay because of T212418

https://gerrit.wikimedia.org/r/481495

Is there any status update, or ETA on this?

ping @Cmjohnson

Mentioned in SAL (#wikimedia-operations) [2019-01-08T16:56:50Z] <urandom> forcing removal of restbase1016-a (host down way too long to salvage) -- T212418

We're currently in the process of force-removing these instances. We'll need to coordinate when the host comes back up, as we'll have to re-bootstrap all 3 instances.

Mentioned in SAL (#wikimedia-operations) [2019-01-08T22:12:54Z] <urandom> forcing removal of restbase1016-b (host down way too long to salvage) -- T212418

Mentioned in SAL (#wikimedia-operations) [2019-01-09T13:32:04Z] <urandom> forcing removal of restbase1016-c (host down way too long to salvage) -- T212418

Record: 4
Date/Time: 11/17/2017 19:18:35
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A1.

Record: 5
Date/Time: 11/17/2017 19:22:08
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_A1.

Record: 6
Date/Time: 02/13/2018 22:08:17
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A2.

Record: 7
Date/Time: 02/14/2018 12:26:34
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_A2.

Record: 8
Date/Time: 12/20/2018 12:12:05
Source: system
Severity: Ok

Description: A problem was detected in Memory Reference Code (MRC).

Record: 9
Date/Time: 12/20/2018 12:12:05
Source: system
Severity: Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A2.

I need to move DIMM around and do standard troubleshooting. Is this server able to be powered off and down in icinga?

In T212418#4869959, @Cmjohnson wrote:

I need to move DIMM around and do standard troubleshooting. Is this server able to be powered off and down in icinga?

It's completely unusable on our end; We had no choice but to brute-force remove the Cassandra nodes in the last days. You can take it down to do whatever you need. We should consider re-imaging it before trying to bring it back on line anyway.

@Eevans I am going to have to power it back on and let it go for a few days to see if the error returns, will that present an issue for you?

While the server is offline I took this opportunity to update the f/w on the bios and idrac.

In T212418#4870100, @Cmjohnson wrote:

@Eevans I am going to have to power it back on and let it go for a few days to see if the error returns, will that present an issue for you?

It should not, but if you could give me a heads up when you power it back on, that would be appreciated.

I ended up leaving the production cables disconnected.

The log remains clear and no erros have returned. I will give it another 24 hours and if no change then it can go back into service.

@Eevans The error has not returned, I cannot say with 100% certainty that it will not return but for now please take the server back and do what you need. All the cables are plugged back in and the server is off. I will leave this open for a few days, lmk if the error returns.

FYI the host is currently down due to a partial power issue in that rack.

Mentioned in SAL (#wikimedia-operations) [2019-01-15T23:54:08Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@a04ebdd]: Restart RESTBase to pick up the fact that restbase1016 is not there - T212418

Mentioned in SAL (#wikimedia-operations) [2019-01-16T00:15:42Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@a04ebdd]: Restart RESTBase to pick up the fact that restbase1016 is not there - T212418 (duration: 21m 34s)

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Jan 16 2019, 2:52 PM

RobH mentioned this in T213859: eqiad: rack a3 pdu swap / failure / replacement.Jan 17 2019, 8:55 PM

So now we should be able to get restbase1016 back into the cluster. Since we need to re-bootstrap the instances in, we can either:

carefully remove all cassandra data and instance dirs and restart the process; or
reimage the server and start from scratch.

Personally, I think option 2 is better, as it both gets us one more server onto stretch, but also ensures a clean slate. @fgiunchedi what do you think? Do you have any preferences?

In T212418#4890534, @mobrovac wrote:

So now we should be able to get restbase1016 back into the cluster. Since we need to re-bootstrap the instances in, we can either:

carefully remove all cassandra data and instance dirs and restart the process; or

reimage the server and start from scratch.

Personally, I think option 2 is better, as it both gets us one more server onto stretch, but also ensures a clean slate. @fgiunchedi what do you think? Do you have any preferences?

I agree we might as well reimage the host with stretch while we're at it. I'll get on the reimage part today so that we're ready for cassandra bootstrap at least.

Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts:

restbase1016.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201901180926_filippo_161638_restbase1016_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['restbase1016.eqiad.wmnet']

Of which those FAILED:

['restbase1016.eqiad.wmnet']

Change 485165 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: reimage restbase1016 with stretch

https://gerrit.wikimedia.org/r/485165

Change 485165 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: reimage restbase1016 with stretch

https://gerrit.wikimedia.org/r/485165

Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts:

restbase1016.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201901180935_filippo_164457_restbase1016_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['restbase1016.eqiad.wmnet']

Of which those FAILED:

['restbase1016.eqiad.wmnet']

@mobrovac @Eevans the host has been reimaged and ready for cassandra bootstraps. I'll start with the bootstraps later today.

Mentioned in SAL (#wikimedia-operations) [2019-01-18T13:18:42Z] <godog> start cassandra-a on restbase1016 - T212418

Bootstrap failed ATM, I'll try again with replace_address

ERROR [main] 2019-01-18 13:18:28,813 CassandraDaemon.java:708 - Exception encountered during startup
java.lang.RuntimeException: A node with address /10.64.0.32 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
        at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:558) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:807) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:667) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:613) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:379) [apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:602) [apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:691) [apache-cassandra-3.11.2.jar:3.11.2]

fgiunchedi mentioned this in T214166: Improve cassandra JBOD integration post-reimage.Jan 18 2019, 1:50 PM

Mentioned in SAL (#wikimedia-operations) [2019-01-18T17:24:16Z] <godog> bootstrap cassandra-b on restbase1016 - T212418

Mentioned in SAL (#wikimedia-operations) [2019-01-18T20:47:34Z] <mobrovac> restbase/cassandra bootstrap restbase1016-c - T212418

Mentioned in SAL (#wikimedia-operations) [2019-01-18T23:53:14Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@f24d681]: Deploy latest version to restbase1016 (was out of rotation) - T212418

Mentioned in SAL (#wikimedia-operations) [2019-01-18T23:53:48Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@f24d681]: Deploy latest version to restbase1016 (was out of rotation) - T212418 (duration: 00m 34s)

All of the instances have joined the ring (thnx @fgiunchedi!) and the latest version of RESTBase is in place, so we are good. There is one problem, now, though: I can't seem to be able to pool the node back. Let's try and see what this is about before resolving the ticket.

Eevans awarded a token.Jan 20 2019, 7:06 PM

In T212418#4893779, @mobrovac wrote:

All of the instances have joined the ring (thnx @fgiunchedi!) and the latest version of RESTBase is in place, so we are good. There is one problem, now, though: I can't seem to be able to pool the node back. Let's try and see what this is about before resolving the ticket.

Indeed neither can I:

restbase1016:~$ pool-restbase 
restbase1016:~$ echo $?
2

Though I'm not going to have time to investigate further this week, any help is welcome

In T212418#4895809, @fgiunchedi wrote:
In T212418#4893779, @mobrovac wrote:

All of the instances have joined the ring (thnx @fgiunchedi!) and the latest version of RESTBase is in place, so we are good. There is one problem, now, though: I can't seem to be able to pool the node back. Let's try and see what this is about before resolving the ticket.

Indeed neither can I:
restbase1016:~$ pool-restbase 
restbase1016:~$ echo $?
2
Though I'm not going to have time to investigate further this week, any help is welcome

@Joe could you take a look ^ and enlighten us?

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 10:56 PM

removing ops-eqiad since this has moved past the data center need. Assigned to @Joe

In T212418#4904109, @mobrovac wrote:
In T212418#4895809, @fgiunchedi wrote:
In T212418#4893779, @mobrovac wrote:

All of the instances have joined the ring (thnx @fgiunchedi!) and the latest version of RESTBase is in place, so we are good. There is one problem, now, though: I can't seem to be able to pool the node back. Let's try and see what this is about before resolving the ticket.

Indeed neither can I:
restbase1016:~$ pool-restbase 
restbase1016:~$ echo $?
2
Though I'm not going to have time to investigate further this week, any help is welcome
@Joe could you take a look ^ and enlighten us?

I ran "sudo -i confctl --quiet select dc=eqiad,cluster=restbase,service=restbase,name=restbase1016.eqiad.wmnet set/pooled=yes" on puppetmaster1001 and that worked fine.

this is a result of a defect in python3-etcd packaging (so, blame me!)

oblivian@restbase1016:~$ sudo -i confctl --quiet select dc=eqiad,cluster=restbase,service=restbase,name=restbase1016.eqiad.wmnet get
CRITICAL:conftool:Could not load driver etcd: No module named 'dns'
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/conftool/backend.py", line 15, in __init__
    exec(compile(open(driver_file).read(), driver_file, 'exec'), ctx)
  File "/usr/lib/python3/dist-packages/conftool/drivers/etcd.py", line 4, in <module>
    import etcd
  File "/usr/lib/python3/dist-packages/etcd/__init__.py", line 2, in <module>
    from .client import Client
  File "/usr/lib/python3/dist-packages/etcd/client.py", line 21, in <module>
    import dns.resolver
ImportError: No module named 'dns'

this is a duplicate of T209136 which is not fixed on stretch after all.

oblivian@restbase1016:~$ sudo -i pool-restbase 
oblivian@restbase1016:~$ echo $?
0

Memory error on restbase1016Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Description: Correctable memory error rate exceeded for DIMM_A1.

Description: Correctable memory error rate exceeded for DIMM_A1.

Description: Correctable memory error rate exceeded for DIMM_A2.

Description: Correctable memory error rate exceeded for DIMM_A2.

Description: A problem was detected in Memory Reference Code (MRC).

Memory error on restbase1016
Closed, ResolvedPublic
Actions