Page MenuHomePhabricator

wtp2019 - hardware (RAM) check
Closed, ResolvedPublic

Description

wtp2019 went down on 09-13 and showed "Uncorrectable Memory Error " at bootup.

after powercycling it came back like normal

then it ran ok for a couple days and went down again 09-18/19, this time not showing an error, just being down.

again it came back after a powercycle and didn't show the error, but since it happened more than once now,
let's check the RAM / board.

Event Timeline

in syslog it just ends in the middle of normal operation and then starts again when it was powered up:

523 Sep 19 04:26:38 wtp2019 puppet-agent[39540]: Retrieving plugin
524 Sep 19 04:26:38 wtp2019 puppet-agent[39540]: Loading facts
525 Sep 19 04:26:45 wtp2019 puppet-agent[39540]: Caching catalog for wtp2019.codfw.wmnet
526 Sep 20 00:44:48 wtp2019 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="1043" x-info="http://www.rsyslog.com"] start
527 Sep 20 00:44:48 wtp2019 systemd[1]: Started udev Coldplug all Devices.
528 Sep 20 00:44:48 wtp2019 systemd[1]: Started Apply Kernel Variables.
529 Sep 20 00:44:48 wtp2019 systemd[1]: Starting udev Wait for Complete Device Initialization...

Dzahn triaged this task as Medium priority.Sep 22 2016, 2:40 AM

So this system is back online and working, but seems to ahve had this memory error twice noted.

It should have some downtime scheduled and have a memtest run on the host while it is still in warranty.

The directions on https://wikitech.wikimedia.org/wiki/Parsoid#Misc_stuff show how to pool a host, and I would assume depooling is simply done in the reverse. Are there any undocumented steps to depooling a wtp host for a 24 hour memtest?

RobH added subscribers: Papaul, mobrovac, ssastry.

Confirmed with @mobrovac about this:

Steps to depool:

  • sync up with @ssastry when we're offlining the host so services is aware
  • depool the host with confctl
  • remove from scap list for syncs (not required if it was a fast reboot, but this will be offline for 12-24 hours while the test runs overnight.

@Papaul: let us know a good day to run this next week, and then @ssastry can just confirm its ok.

Change 363689 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/parsoid/deploy@master] Temporarily remove wtp2019 from the target for a memtest

https://gerrit.wikimedia.org/r/363689

The above patch removes wtp2019 from the list of deployment target nodes. Before putting the node down, merge the patch and revert it after it is back online and pooled.

Change 363689 merged by jenkins-bot:
[mediawiki/services/parsoid/deploy@master] Temporarily remove wtp2019 from the target for a memtest

https://gerrit.wikimedia.org/r/363689

About to deploy but that patch is merged and wtp2019 is still pooled,

{"wtp2019.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}

I'm going to depool it.

I also noticed,

{"wtp1001.eqiad.wmnet": {"pooled": "no", "weight": 15}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"}

Not sure why that is ... but I'll put it back.

Today's deploy is done. Note that before repooling, you'll need to update Parsoid to the latest commit on wtp2019. Thanks!

This is now scheduled to take place on 2017-07-17.

@Papaul: This was already depooled, and is now shutdown and in maint-mode in icinga for the next 24 hours.

Please run the RAM test today/overnight, and check its status tomorrow. If it all passes, the system can be powered back to the OS, and then assign this task to me for followup in repooling services. If it fails, please sync up with me to extend the icinga downtime so we can get it repaired/replaced.

Thanks!

RobH raised the priority of this task from Medium to High.Jul 17 2017, 6:10 PM

First test complete. Running the second test

Test complete with no errors .

I'll deploy and repool the node now.

Change 366009 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/parsoid/deploy@master] Revert "Temporarily remove wtp2019 from the target for a memtest"

https://gerrit.wikimedia.org/r/366009

Change 366009 merged by Mobrovac:
[mediawiki/services/parsoid/deploy@master] Revert "Temporarily remove wtp2019 from the target for a memtest"

https://gerrit.wikimedia.org/r/366009

Mentioned in SAL (#wikimedia-operations) [2017-07-18T17:20:37Z] <mobrovac@tin> Started deploy [parsoid/deploy@1eaa07e]: Bring wtp2019 up to date and repool it - T146113

Mentioned in SAL (#wikimedia-operations) [2017-07-18T17:21:39Z] <mobrovac@tin> Finished deploy [parsoid/deploy@1eaa07e]: Bring wtp2019 up to date and repool it - T146113 (duration: 01m 02s)

wtp2019 is now up to date with the latest code and is back in the pool. Resolving.