Page MenuHomePhabricator

mw2383 is misbehaving
Open, MediumPublic

Description

mw2383 was presenting high load and high CPU usage when we switched over from eqiad to codfw (29 Jul 2021). At the time we depooled the server and restarted php-fpm. Today we noticed that it is exhibiting the same behaviour.

Looking at kernel messages, the following messages are logged in higher frequency than other mw* servers:

[Sat Jul 10 12:56:04 2021] traps: php-fpm7.2[15807] general protection ip:7f5cde938013 sp:7ffc2b8afa98 error:0 in libmemcached.so.11.0.0[7f5cde927000+30000]
[Sat Jul 10 14:26:42 2021] php-fpm7.2[14795]: segfault at 60000000e ip 00007f5cde93a2f9 sp 00007ffc2b8af880 error 4 in libmemcached.so.11.0.0[7f5cde927000+30000]

[Sun Jul 11 20:21:33 2021] Code: 00 31 db 4c 8d ac 24 e0 00 00 00 eb 07 0f 1f 40 00 83 c3 01 4c 89 ff e8 85 16 ff ff 39 d8 76 51 89 de 4c 89 ff e8 47 3f 00 00 <44> 8b 58 0c 49 89 c4 45 85 db 74 db 41 f6 47 01 10 0f 85 c8 02 00

[Mon Jul 12 07:17:38 2021] php-fpm7.2[12304]: segfault at 746e65746e00 ip 0000746e65746e00 sp 00007ffc2b8af288 error 14
[Mon Jul 12 07:17:38 2021] Code: Bad RIP value.

Looking at CPU frequency, it appears that something is triggering throttling and thus not allowing CPU freq to scale up and handle the load

mw2383

image.png (612×1 px, 76 KB)

mw2384

image.png (592×1 px, 78 KB)

Lastly, intel-microcode is up to date, and nothing interesting stood out when checking the management console log.

DC-Ops, can we check if rhe firmware is up to date? Server is depooled as it was increasing our overall latency.

Event Timeline

jijiki triaged this task as Medium priority.Mon, Jul 12, 1:29 PM
jijiki updated the task description. (Show Details)
jijiki added subscribers: serviceops, DC-Ops.
RobH edited subscribers, added: RobH; removed: DC-Ops.

So I just happened to notice this, but in the future, please file requests using the form, as it outlines what has to happen. One of those things is assigning the proper site, which I've now corrected with appending in ops-codfw.

Basically my unrequested and unrequired monitoring of DC-Ops subscriptions is why I found this, but in the future its best to use the form and file it for the site in question directly, thanks!

I'll pull down the firmware and try to flash it shortly.

RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
RobH edited projects, added ops-codfw; removed ops-eqiad.
RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.

Effie,

I updated the firmware on this to the latest version, it hadn't been updated since it's purchase and was a couple of revisions out of date. It is now sitting back ready to be placed into service, I've not resolved this task since its not back in service.

I depooled mw2383 as it was behaving as before. @RobH is it possible to run hardware tests on the host? I suspect it is might be a hardware issue. Thank you!

I depooled mw2383 as it was behaving as before. @RobH is it possible to run hardware tests on the host? I suspect it is might be a hardware issue. Thank you!

Definitely, I'll go ahead and push this into the Dell ePSA testing suite to see if it finds the issue.

This is now running Dell's hardware test suite.

@jijiki can we depool mw2383 from scap if it's going to be down for an extended amount of time?

@Legoktm I will keep it in mind to mark it as inactive, thanks

<logmsgbot> !log ariel@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2383.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-07-14T14:47:01Z] <effie> set mw2384 as inactive to investigate mw2383 issue - T286463

Summary of troubleshooting so far to see why this was throttling the CPU:

  • update idrac and bios firmware to latest revisions
    • this did not fix the issue when the host was returned to service post update
  • comparison of bios settings to working system mw2384, all settings are identical
  • run of Dell's ePSA test suite to see if any hardware reports failure, no errors.

At this point, I'm not sure what to try to duplicate this error of CPU throttling. Do the logs of the throttling denote which CPU is being throttled, or is it both of them?

If it is a single CPU, we can try swapping it out under support contract and see if the issue gets resolved.

Mentioned in SAL (#wikimedia-operations) [2021-07-14T14:47:01Z] <effie> set mw2384 as inactive to investigate mw2383 issue - T286463

I ACKed the Icinga alerts about mismatching Mediawiki version and not being in DSH groups. Just make sure to 'scap pull' before repooling.

There are no logs that indicate throttling but rather what I see from the graphs . It appears that the CPU does not scale up higher than ~1GHz. Also, the message [Mon Jul 12 07:17:38 2021] Code: Bad RIP value. yields that there might be a hardware error. I will try to find out if it is a single CPU being throttled.