Page MenuHomePhabricator

Memory error on restbase1016
Closed, ResolvedPublic

Description

When rebooting restbase1016, it stopped the boot complaining about a broken memory DIMM:

UEFI0107: One or more memory errors have occurred on memory slot: A2. Remove 
input power to the system, reseat the DIMM module and restart the system. If the 
issues persist, replace the faulty memory module identified in the message.

UEFI0081: Memory configuration has changed from the last time the system was
started.
If the change is expected, no action is necessary. Otherwise, check the DIMM
population inside the system and memory settings in System Setup.

UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory
Module (DIMM) is not functioning.
Check the System Event Log (SEL) to identify the non-functioning DIMM, and then
replace it.

Event Timeline

Change 481495 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Also increase temporarily the delay because of T212418

https://gerrit.wikimedia.org/r/481495

Change 481495 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Also increase temporarily the delay because of T212418

https://gerrit.wikimedia.org/r/481495

Is there any status update, or ETA on this?

Mentioned in SAL (#wikimedia-operations) [2019-01-08T16:56:50Z] <urandom> forcing removal of restbase1016-a (host down way too long to salvage) -- T212418

We're currently in the process of force-removing these instances. We'll need to coordinate when the host comes back up, as we'll have to re-bootstrap all 3 instances.

Mentioned in SAL (#wikimedia-operations) [2019-01-08T22:12:54Z] <urandom> forcing removal of restbase1016-b (host down way too long to salvage) -- T212418

Mentioned in SAL (#wikimedia-operations) [2019-01-09T13:32:04Z] <urandom> forcing removal of restbase1016-c (host down way too long to salvage) -- T212418

Record: 4
Date/Time: 11/17/2017 19:18:35
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A1.

Record: 5
Date/Time: 11/17/2017 19:22:08
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_A1.

Record: 6
Date/Time: 02/13/2018 22:08:17
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A2.

Record: 7
Date/Time: 02/14/2018 12:26:34
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_A2.

Record: 8
Date/Time: 12/20/2018 12:12:05
Source: system
Severity: Ok

Description: A problem was detected in Memory Reference Code (MRC).

Record: 9
Date/Time: 12/20/2018 12:12:05
Source: system
Severity: Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A2.

I need to move DIMM around and do standard troubleshooting. Is this server able to be powered off and down in icinga?

I need to move DIMM around and do standard troubleshooting. Is this server able to be powered off and down in icinga?

It's completely unusable on our end; We had no choice but to brute-force remove the Cassandra nodes in the last days. You can take it down to do whatever you need. We should consider re-imaging it before trying to bring it back on line anyway.

@Eevans I am going to have to power it back on and let it go for a few days to see if the error returns, will that present an issue for you?

While the server is offline I took this opportunity to update the f/w on the bios and idrac.

@Eevans I am going to have to power it back on and let it go for a few days to see if the error returns, will that present an issue for you?

It should not, but if you could give me a heads up when you power it back on, that would be appreciated.

I ended up leaving the production cables disconnected.

The log remains clear and no erros have returned. I will give it another 24 hours and if no change then it can go back into service.

@Eevans The error has not returned, I cannot say with 100% certainty that it will not return but for now please take the server back and do what you need. All the cables are plugged back in and the server is off. I will leave this open for a few days, lmk if the error returns.

FYI the host is currently down due to a partial power issue in that rack.

Mentioned in SAL (#wikimedia-operations) [2019-01-15T23:54:08Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@a04ebdd]: Restart RESTBase to pick up the fact that restbase1016 is not there - T212418

Mentioned in SAL (#wikimedia-operations) [2019-01-16T00:15:42Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@a04ebdd]: Restart RESTBase to pick up the fact that restbase1016 is not there - T212418 (duration: 21m 34s)

So now we should be able to get restbase1016 back into the cluster. Since we need to re-bootstrap the instances in, we can either:

  1. carefully remove all cassandra data and instance dirs and restart the process; or
  2. reimage the server and start from scratch.

Personally, I think option 2 is better, as it both gets us one more server onto stretch, but also ensures a clean slate. @fgiunchedi what do you think? Do you have any preferences?

So now we should be able to get restbase1016 back into the cluster. Since we need to re-bootstrap the instances in, we can either:

  1. carefully remove all cassandra data and instance dirs and restart the process; or
  2. reimage the server and start from scratch.

Personally, I think option 2 is better, as it both gets us one more server onto stretch, but also ensures a clean slate. @fgiunchedi what do you think? Do you have any preferences?

I agree we might as well reimage the host with stretch while we're at it. I'll get on the reimage part today so that we're ready for cassandra bootstrap at least.

Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts:

restbase1016.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201901180926_filippo_161638_restbase1016_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['restbase1016.eqiad.wmnet']

Of which those FAILED:

['restbase1016.eqiad.wmnet']

Change 485165 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: reimage restbase1016 with stretch

https://gerrit.wikimedia.org/r/485165

Change 485165 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: reimage restbase1016 with stretch

https://gerrit.wikimedia.org/r/485165

Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts:

restbase1016.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201901180935_filippo_164457_restbase1016_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['restbase1016.eqiad.wmnet']

Of which those FAILED:

['restbase1016.eqiad.wmnet']

@mobrovac @Eevans the host has been reimaged and ready for cassandra bootstraps. I'll start with the bootstraps later today.

Mentioned in SAL (#wikimedia-operations) [2019-01-18T13:18:42Z] <godog> start cassandra-a on restbase1016 - T212418

Bootstrap failed ATM, I'll try again with replace_address

ERROR [main] 2019-01-18 13:18:28,813 CassandraDaemon.java:708 - Exception encountered during startup
java.lang.RuntimeException: A node with address /10.64.0.32 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
        at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:558) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:807) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:667) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:613) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:379) [apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:602) [apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:691) [apache-cassandra-3.11.2.jar:3.11.2]

Mentioned in SAL (#wikimedia-operations) [2019-01-18T17:24:16Z] <godog> bootstrap cassandra-b on restbase1016 - T212418

Mentioned in SAL (#wikimedia-operations) [2019-01-18T20:47:34Z] <mobrovac> restbase/cassandra bootstrap restbase1016-c - T212418

Mentioned in SAL (#wikimedia-operations) [2019-01-18T23:53:14Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@f24d681]: Deploy latest version to restbase1016 (was out of rotation) - T212418

Mentioned in SAL (#wikimedia-operations) [2019-01-18T23:53:48Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@f24d681]: Deploy latest version to restbase1016 (was out of rotation) - T212418 (duration: 00m 34s)

All of the instances have joined the ring (thnx @fgiunchedi!) and the latest version of RESTBase is in place, so we are good. There is one problem, now, though: I can't seem to be able to pool the node back. Let's try and see what this is about before resolving the ticket.

fgiunchedi added a subscriber: Cmjohnson.

All of the instances have joined the ring (thnx @fgiunchedi!) and the latest version of RESTBase is in place, so we are good. There is one problem, now, though: I can't seem to be able to pool the node back. Let's try and see what this is about before resolving the ticket.

Indeed neither can I:

restbase1016:~$ pool-restbase 
restbase1016:~$ echo $?
2

Though I'm not going to have time to investigate further this week, any help is welcome

All of the instances have joined the ring (thnx @fgiunchedi!) and the latest version of RESTBase is in place, so we are good. There is one problem, now, though: I can't seem to be able to pool the node back. Let's try and see what this is about before resolving the ticket.

Indeed neither can I:

restbase1016:~$ pool-restbase 
restbase1016:~$ echo $?
2

Though I'm not going to have time to investigate further this week, any help is welcome

@Joe could you take a look ^ and enlighten us?

Cmjohnson removed a project: ops-eqiad.

removing ops-eqiad since this has moved past the data center need. Assigned to @Joe

All of the instances have joined the ring (thnx @fgiunchedi!) and the latest version of RESTBase is in place, so we are good. There is one problem, now, though: I can't seem to be able to pool the node back. Let's try and see what this is about before resolving the ticket.

Indeed neither can I:

restbase1016:~$ pool-restbase 
restbase1016:~$ echo $?
2

Though I'm not going to have time to investigate further this week, any help is welcome

@Joe could you take a look ^ and enlighten us?

I ran "sudo -i confctl --quiet select dc=eqiad,cluster=restbase,service=restbase,name=restbase1016.eqiad.wmnet set/pooled=yes" on puppetmaster1001 and that worked fine.

this is a result of a defect in python3-etcd packaging (so, blame me!)

oblivian@restbase1016:~$ sudo -i confctl --quiet select dc=eqiad,cluster=restbase,service=restbase,name=restbase1016.eqiad.wmnet get
CRITICAL:conftool:Could not load driver etcd: No module named 'dns'
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/conftool/backend.py", line 15, in __init__
    exec(compile(open(driver_file).read(), driver_file, 'exec'), ctx)
  File "/usr/lib/python3/dist-packages/conftool/drivers/etcd.py", line 4, in <module>
    import etcd
  File "/usr/lib/python3/dist-packages/etcd/__init__.py", line 2, in <module>
    from .client import Client
  File "/usr/lib/python3/dist-packages/etcd/client.py", line 21, in <module>
    import dns.resolver
ImportError: No module named 'dns'

this is a duplicate of T209136 which is not fixed on stretch after all.

oblivian@restbase1016:~$ sudo -i pool-restbase 
oblivian@restbase1016:~$ echo $?
0