Page MenuHomePhabricator

(euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers
Open, HighPublic

Description

Since august 21st (after the train was deployed to group1, but not necessarily because of that), we've started seeing occasions in which the two full-php7 api servers with high traffic would:

  • Spike up in cpu
  • Occupy all the fpm workers as busy
  • raise the response times.

The weirdest part of it is the slowdowns appear to happen in tandem between the two servers, at the same times.

This is also probably the cause of the slowdowns noticed by @Tarrow in T230976

We've identified most slow requests seem to come from parsoid, but I will be more precise when the investigation happens.

It appears that this is related when templates change on euwiki (just a theory). Even if this is due to templates changes, it shouldn't be slowing down our API as much as it does

Event Timeline

Joe created this task.Aug 22 2019, 2:20 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 22 2019, 2:20 PM
Joe triaged this task as High priority.Aug 22 2019, 2:20 PM
jijiki updated the task description. (Show Details)Aug 22 2019, 2:23 PM
jijiki added a subscriber: jijiki.

Mentioned in SAL (#wikimedia-operations) [2019-08-22T14:37:19Z] <_joe_> restarting php-fpm on mw1348 to observe the effect on the slowdown, T231011

Joe added a comment.Aug 23 2019, 1:35 PM

First smoking gun is in all the intervals I controlled the offender was parsoid-batch with quite large requests. I'm trying to gathering more cases to create a better statistics.

ema added a subscriber: ema.Aug 26 2019, 4:38 PM
CDanis added a subscriber: CDanis.Aug 27 2019, 1:41 PM

Mentioned in SAL (#wikimedia-operations) [2019-08-29T08:11:21Z] <_joe_> disabling zend GC on mw1347, testing an hypothesis for T231011

Krinkle moved this task from Untriaged to Other on the PHP 7.2 support board.Sep 6 2019, 12:24 AM
jijiki moved this task from Backlog to Next up on the serviceops board.Sep 13 2019, 4:34 AM

Looks like this is back since yesterday:


Logs show many requests that take a long time to finish are of type action=parsoid-batch on euwiki version 1.34.0-wmf.23

jijiki renamed this task from Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) to Mysterious, coordinated slowdowns every ~ 25 minutes on API servers.Oct 29 2019, 1:29 PM

@Ladsgroup might be on the correct track here. The spikes neatly coincide with the WD bot updates.

@Ladsgroup might be on the correct track here. The spikes neatly coincide with the WD bot updates.

The patch that actually smooths this just got merged. Planning to backport it now.

Mentioned in SAL (#wikimedia-operations) [2019-11-14T16:04:20Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikibase: [[gerrit:550828|Put a layer of APC cache on top of reading wb_terms in SqlEntityInfoBuilder]] (T231011 T229407 T236681), Try II (duration: 00m 56s)

Joe added a subscriber: Theklan.Thu, Nov 14, 9:57 PM

@Theklan might be interested in watching this task too.

jijiki renamed this task from Mysterious, coordinated slowdowns every ~ 25 minutes on API servers to (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers.Fri, Nov 15, 10:51 AM
jijiki updated the task description. (Show Details)

Last days I have been heavily merging some templates, and it may be related.

@Theklan We suspect it is something on the production side since we have noticed this behaviour in the past. Moreover, this is not the only wiki where templates are changed, but it manages to standout in our investigations, which shouldn't happen. Nevertheless, thanks for letting us know :)

Ok! Now there's nothing special happening at euwiki and there are not new merges happening, so it should be another thing.