Page MenuHomePhabricator

(euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers
Closed, ResolvedPublic

Description

Since august 21st (after the train was deployed to group1, but not necessarily because of that), we've started seeing occasions in which the two full-php7 api servers with high traffic would:

  • Spike up in cpu
  • Occupy all the fpm workers as busy
  • raise the response times.

The weirdest part of it is the slowdowns appear to happen in tandem between the two servers, at the same times.

This is also probably the cause of the slowdowns noticed by @Tarrow in T230976

We've identified most slow requests seem to come from parsoid, but I will be more precise when the investigation happens.

T231011.png (1×3 px, 348 KB)

It appears that this is related when templates change on euwiki (just a theory). Even if this is due to templates changes, it shouldn't be slowing down our API as much as it does

Event Timeline

Joe triaged this task as High priority.Aug 22 2019, 2:20 PM

Mentioned in SAL (#wikimedia-operations) [2019-08-22T14:37:19Z] <_joe_> restarting php-fpm on mw1348 to observe the effect on the slowdown, T231011

First smoking gun is in all the intervals I controlled the offender was parsoid-batch with quite large requests. I'm trying to gathering more cases to create a better statistics.

Mentioned in SAL (#wikimedia-operations) [2019-08-29T08:11:21Z] <_joe_> disabling zend GC on mw1347, testing an hypothesis for T231011

Looks like this is back since yesterday:

gets.png (1×3 px, 170 KB)

apiload.png (943×3 px, 1 MB)

Logs show many requests that take a long time to finish are of type action=parsoid-batch on euwiki version 1.34.0-wmf.23

jijiki renamed this task from Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) to Mysterious, coordinated slowdowns every ~ 25 minutes on API servers.Oct 29 2019, 1:29 PM

@Ladsgroup might be on the correct track here. The spikes neatly coincide with the WD bot updates.

@Ladsgroup might be on the correct track here. The spikes neatly coincide with the WD bot updates.

The patch that actually smooths this just got merged. Planning to backport it now.

Mentioned in SAL (#wikimedia-operations) [2019-11-14T16:04:20Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikibase: [[gerrit:550828|Put a layer of APC cache on top of reading wb_terms in SqlEntityInfoBuilder]] (T231011 T229407 T236681), Try II (duration: 00m 56s)

@Theklan might be interested in watching this task too.

jijiki renamed this task from Mysterious, coordinated slowdowns every ~ 25 minutes on API servers to (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers.Nov 15 2019, 10:51 AM
jijiki updated the task description. (Show Details)

Last days I have been heavily merging some templates, and it may be related.

@Theklan We suspect it is something on the production side since we have noticed this behaviour in the past. Moreover, this is not the only wiki where templates are changed, but it manages to standout in our investigations, which shouldn't happen. Nevertheless, thanks for letting us know :)

Ok! Now there's nothing special happening at euwiki and there are not new merges happening, so it should be another thing.

Anybody knows if this is still an issue nowadays?

Joe claimed this task.

Yes, this is yet another WTF resolved by killing parsoid-js.