thumbor1003 behaves differently than other thumbor hosts
Closed, ResolvedPublic

Description

It gets a lot more process restarts than other hosts, what looks like twice the load and CPU usage, spiky IOPS. I wonder if something is up with its hardware.

Gilles created this task.Sep 5 2017, 8:00 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 5 2017, 8:00 AM
Gilles moved this task from Inbox to Radar on the Performance-Team board.Sep 6 2017, 7:28 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

For some reason the MemoryLimit=15% change from https://gerrit.wikimedia.org/r/#/c/367373/ doesn't seem to be applied on thumbor1003 and that causes io spikes and additional latency

I suspected it was something like that :) Should be an easy fix, then!

Change 377264 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: use memorysize_mb fact for unit MemoryLimit

https://gerrit.wikimedia.org/r/377264

Change 377264 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: use memorysize_mb fact for unit MemoryLimit

https://gerrit.wikimedia.org/r/377264

fgiunchedi closed this task as Resolved.Sep 11 2017, 3:03 PM

Indeed, the latency now is the same across all hosts and I've deployed a fix for MemoryLimit to actually DTRT with jessie's systemd, resolving.