Maniphest T196252

Labservices1001 crashing, probable overheating
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Jun 3 2018, 2:26 AM

Description

Just now, at 8:44 PM CDT, Labservices1001 froze up and stopped responding.

I'm trained to ignore temperature warnings in the syslog, but, here's the syslog before the freeze:

Jun  3 01:36:36 labservices1001 pdns[1750]: gmysql Connection successful. Connected to database 'pdns' on 'localhost'.
Jun  3 01:36:36 labservices1001 pdns[1750]: AXFR done for 'eqiad.wmflabs', zone committed with serial number 1527989790
Jun  3 01:36:36 labservices1001 pdns[1750]: AXFR started for 'eqiad.wmflabs'
Jun  3 01:36:36 labservices1001 pdns[1750]: Transaction started for 'eqiad.wmflabs'
Jun  3 01:36:37 labservices1001 pdns[1750]: AXFR done for 'eqiad.wmflabs', zone committed with serial number 1527989790
Jun  3 01:37:01 labservices1001 CRON[48470]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958866] CPU1: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958872] CPU0: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958877] CPU4: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958882] CPU2: Package temperature above threshold, cpu clock throttled (total events = 366)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958890] CPU5: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958896] CPU3: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958901] CPU6: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.958906] CPU7: Package temperature above threshold, cpu clock throttled (total events = 367)
Jun  3 01:37:54 labservices1001 kernel: [4358342.959793] CPU0: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358342.959795] CPU4: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358342.959811] CPU2: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358342.959826] CPU3: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358342.959830] CPU5: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358342.959835] CPU6: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358342.959839] CPU7: Package temperature/speed normal
Jun  3 01:37:54 labservices1001 kernel: [4358343.038185] CPU1: Package temperature/speed normal
Jun  3 01:38:01 labservices1001 CRON[48620]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jun  3 01:39:01 labservices1001 CRON[48730]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jun  3 01:40:01 labservices1001 CRON[48878]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jun  3 01:41:01 labservices1001 CRON[49009]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jun  3 01:42:01 labservices1001 CRON[49137]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)

(then silence until I rebooted it)

Details

	Subject	Repo	Branch	Lines +/-
	Openstack: page wmcs team for host failures	operations/puppet	production	+15 -0

Customize query in gerrit

Related Objects

Mentioned In: T163402: Ensure we can survive a loss of labservices1001

Event Timeline

Andrew created this task.Jun 3 2018, 2:26 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 3 2018, 2:26 AM

Krenair subscribed.Jun 3 2018, 2:53 AM

Paladox subscribed.Jun 3 2018, 7:36 AM

this server is out of warranty. In the past I've reapplied thermal paste. Let me know if you would like to schedule a time to do that.

• Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Jun 4 2018, 3:34 PM

In the past I've reapplied thermal paste. Let me know if you would like to schedule a time to do that.

You've done that on this exact server? If so, is it likely that a second attempt would make a difference?

Oh, then the issue is probably not related to the thermal paste. Was something else going on that would stress the CPU at that time?

Sorry, @Cmjohnson, to be clear I was asking if you've already replaced the paste on this server, not saying that I think you have.

A note on not-paging for our weekly meeting: https://phabricator.wikimedia.org/T152368#2849231

@Andrew sorry, I did misunderstand. I need to purchase more thermal paste. Can this wait a few days?

Can this wait a few days?

Sure.

• Cmjohnson moved this task from Up next to High Priority Task on the ops-eqiad board.Jun 5 2018, 6:17 PM

Joe triaged this task as Medium priority.Jun 18 2018, 12:42 PM

• Cmjohnson moved this task from High Priority Task to Cloud Tasks on the ops-eqiad board.Jun 26 2018, 3:51 PM

@Andrew We need thermal paste. I have created a procurement task https://phabricator.wikimedia.org/T198326. Once it arrives I will ping you regarding a good day/time to power off.

@Cmjohnson Sounds good, thanks for the update. That server seems to be holding steady for now.

• Vvjjkkii renamed this task from Labservices1001 crashed to vrbaaaaaaa.Jul 1 2018, 1:06 AM

• Vvjjkkii raised the priority of this task from Medium to High.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

This just happened again

Jul  1 02:32:51 labservices1001 pdns[2129]: Domain 'deployment-prep.wmflabs.org' is fresh (not presigned, no RRSIG check)
Jul  1 02:32:51 labservices1001 pdns[2129]: Domain 'design.wmflabs.org' is fresh (not presigned, no RRSIG check)
Jul  1 02:32:51 labservices1001 pdns[2129]: Domain 'discovery-stats.wmflabs.org' is fresh (not presigned, no RRSIG check)
Jul  1 02:33:01 labservices1001 CRON[45175]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jul  1 02:34:01 labservices1001 CRON[45298]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jul  1 02:47:07 labservices1001 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="935" x-info="http://www.rsyslog.com"] start
Jul  1 02:47:07 labservices1001 rsyslogd: rsyslogd's groupid changed to 104
Jul  1 02:47:07 labservices1001 rsyslogd: rsyslogd's userid changed to 101
Jul  1 02:47:07 labservices1001 kernel: [    0.000000] Initializing cgroup subsys cpuset
Jul  1 02:47:07 labservices1001 kernel: [    0.000000] Initializing cgroup subsys cpu
Jul  1 02:47:07 labservices1001 kernel: [    0.000000] Initializing cgroup subsys cpuacct

Change 443448 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Openstack: page wmcs team for any host failure

https://gerrit.wikimedia.org/r/443448

gerritbot added a project: Patch-For-Review.Jul 2 2018, 3:42 PM

Change 443448 merged by Andrew Bogott:
[operations/puppet@production] Openstack: page wmcs team for host failures

https://gerrit.wikimedia.org/r/443448

Krenair mentioned this in T163402: Ensure we can survive a loss of labservices1001.Jul 13 2018, 8:36 PM

This just happened again -- thanks to better paging I caught it sooner :) There's nothing of interest in the syslog, just a sudden stop:

Aug  5 02:32:06 labservices1001 pdns[22726]: While checking domain freshness: Query to '208.80.155.117:5354' for SOA of 'wikivoyage.wmflabs.org' produced no results (error code: Refused)
Aug  5 02:32:06 labservices1001 pdns[22726]: While checking domain freshness: Query to '208.80.155.117:5354' for SOA of 'hashtags.wmflabs.org' produced no results (error code: Refused)
Aug  5 02:32:09 labservices1001 pdns[22726]: Received serial number updates for 0 zones, had 24 timeouts
Aug  5 02:33:01 labservices1001 CRON[30537]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Aug  5 02:43:28 labservices1001 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="928" x-info="http://www.rsyslog.com"] start
Aug  5 02:43:28 labservices1001 rsyslogd: rsyslogd's groupid changed to 104
Aug  5 02:43:28 labservices1001 rsyslogd: rsyslogd's userid changed to 101
Aug  5 02:43:28 labservices1001 kernel: [    0.000000] Initializing cgroup subsys cpuset
Aug  5 02:43:28 labservices1001 kernel: [    0.000000] Initializing cgroup subsys cpu
Aug  5 02:43:28 labservices1001 kernel: [    0.000000] Initializing cgroup subsys cpuacct
Aug  5 02:43:28 labservices1001 kernel: [    0.000000] Linux version 3.13.0-143-generic (buildd@lcy01-amd64-010) (gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.4) ) #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 (Ubuntu 3.13.0-143.192-generic 3.13.11-ckt39)

Ah, there were some temp warnings a few minutes earlier:

Aug  5 02:29:02 labservices1001 kernel: [3025868.972351] CPU3: Core temperature above threshold, cpu clock throttled (total events = 1981)
Aug  5 02:29:02 labservices1001 kernel: [3025868.982073] CPU3: Package temperature above threshold, cpu clock throttled (total events = 2491)

etc.

Andrew renamed this task from Labservices1001 crashed to Labservices1001 crashing, probable overheating.Aug 5 2018, 2:49 AM

Andrew raised the priority of this task from Medium to High.Aug 5 2018, 3:02 AM

So, this thread is mildly confusing. From what I can see, labservices1001 (warranty expired 2017-04), had its thermal paste replaced at a previous time

In T196252#4254133, @Andrew wrote:

In the past I've reapplied thermal paste. Let me know if you would like to schedule a time to do that.

You've done that on this exact server? If so, is it likely that a second attempt would make a difference?

However, it seems worth doing again, as requested by @Andrew and the cloud team via the SRE weekly meeting.

IRC Sync: @Andrew will schedule downtime with @Cmjohnson. I'm assigning this to @Andrew until the time is scheduled.

Mentioned in SAL (#wikimedia-operations) [2018-08-06T16:53:08Z] <andrewbogott> power down labservices1001 for thermal paste fix, T196252

added thermal paste

Hopefully resolved; we'll see if it overheats again. Thanks @Cmjohnson

Labservices1001 crashing, probable overheatingClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Labservices1001 crashing, probable overheating
Closed, ResolvedPublic
Actions