Page MenuHomePhabricator

s3 database resource usage and contention increased 2-10x times
Closed, ResolvedPublic

Description

Happening since 20:51-20:57 2016-12-13 UTC. Reverting 1.29.0-wmf.6 -> wmf.5 made no difference:

Event Timeline

jcrespo created this task.Dec 14 2016, 10:07 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 14 2016, 10:07 AM
jcrespo updated the task description. (Show Details)Dec 14 2016, 10:23 AM

After a first look, the issue is not 100% clear that this is due to a train deployment, as the issue seems to continue, but there is not appreciable load on group0 wikis. The load on s3 is very timezone sensitive, so there is not yet firm conclusions. @Legoktm and I commented that would be a huge coincidence that the issue started at the same time than the scap.

are there any async jobs running along with group0?

This may be related:
After deploying the train, I noticed a lot of this error in logstash: pl proc line: 2959: warning: points must have either 4 or 2 ( and I filed T153159 )

mmodell triaged this task as High priority.Dec 14 2016, 7:50 PM

@jcrespo: I can revert and see what happens?

Mentioned in SAL (#wikimedia-operations) [2016-12-14T20:13:16Z] <twentyafterfour@tin> rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.5 refs T153184

I just reverted, waiting a while to see if it makes any difference in the graphs linked above..

It doesn't appear to be helping.

demon added a subscriber: demon.Dec 14 2016, 8:19 PM

are there any async jobs running along with group0?
This may be related:
After deploying the train, I noticed a lot of this error in logstash: pl proc line: 2959: warning: points must have either 4 or 2 ( and I filed T153159 )

It's not related, that's gone on for months now. It has something to do with ploticus (for the Timeline extension). I'm pretty sure we see it during deploys because some cache gets invalidated.

ok so rollback of the train hasn't helped. Any other ideas of what could be causing this?

hashar added a subscriber: hashar.Dec 14 2016, 8:41 PM

On s3 we had the same bump of traffic over the week-end from Dec 10 08:55UTC till Dec 11 20:55UTC:

removing from deployment blockers because this is apparently unrelated to the new branch.

jcrespo claimed this task.Dec 14 2016, 9:23 PM
jcrespo updated the task description. (Show Details)
jcrespo removed a project: Release-Engineering-Team.

Thanks for checking. I really mean it. I will continue investigating to see what is the source of this overhead.

@jcrespo: Glad to help, that's what I'm here for :)

This maintenance T152761#2874723 would explain the extra resource usage.

kaldari closed this task as Resolved.Dec 15 2016, 6:12 PM
kaldari added a subscriber: kaldari.

The script is finally finished.