Page MenuHomePhabricator

Depooling single text caching server in esams had a disproportionate performance impact
Open, MediumPublic

Description

@ema depooled and repooled cp3050 on 2019-11-11 and during the interim period we saw a huge regression in end-user performance in Europe:


https://grafana.wikimedia.org/d/000000143/navigation-timing?orgId=1&from=1573451674225&to=1573523748836&var-source=navtiming2&var-metric=responseStart&var-percentile=p50

https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?panelId=54&fullscreen&orgId=1&from=1573466170251&to=1573497052164

There are 7 servers in that cluster, turning off a single server shouldn't cause almost 2x response time.

Event Timeline

Gilles created this task.Tue, Nov 12, 1:24 PM
Restricted Application added a project: Operations. · View Herald TranscriptTue, Nov 12, 1:24 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema moved this task from Triage to Caching on the Traffic board.Tue, Nov 12, 3:19 PM
ema updated the task description. (Show Details)Tue, Nov 12, 3:34 PM

Mentioned in SAL (#wikimedia-operations) [2019-11-12T16:09:45Z] <ema> depool cp3052 and observe performance impact T238085 before reimaging as text_ats T227432

ema triaged this task as Medium priority.Tue, Nov 12, 4:11 PM

Mentioned in SAL (#wikimedia-operations) [2019-11-12T17:03:41Z] <ema> pool cp3052 with ATS backend T238085

Gilles moved this task from Inbox to Radar on the Performance-Team board.Tue, Nov 12, 8:57 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.
CDanis added a subscriber: CDanis.Wed, Nov 20, 1:55 PM