Page MenuHomePhabricator

Rename user creates lag on enwiki
Closed, ResolvedPublic

Description

This rename of a user with 200K contributions created between 0 and 4 minutes of lag on all enwiki slaves.

Identify why and avoid it in the future.

Event Timeline

jcrespo created this task.Oct 23 2015, 7:32 PM
jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added projects: Performance, DBA.
jcrespo added a subscriber: jcrespo.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 23 2015, 7:32 PM

@Legoktm told me that RenameUser.php does not wait for the slaves- fix that in the future.

Legoktm set Security to None.Oct 23 2015, 7:39 PM
Legoktm added a subscriber: aaron.

Change 248486 had a related patch set uploaded (by Aaron Schulz):
Make jobs wait for slaves after each batch

https://gerrit.wikimedia.org/r/248486

Change 248486 merged by jenkins-bot:
Make jobs wait for slaves after each batch

https://gerrit.wikimedia.org/r/248486

Change 248583 had a related patch set uploaded (by Aaron Schulz):
Batch the updates done for page delete/restore handling

https://gerrit.wikimedia.org/r/248583

Change 248583 merged by jenkins-bot:
Batch the updates done for page delete/restore handling

https://gerrit.wikimedia.org/r/248583

aaron closed this task as Resolved.Nov 12 2015, 8:35 PM
aaron claimed this task.
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 12 2015, 8:35 PM
jcrespo reopened this task as Open.Aug 1 2016, 4:17 PM

Reopened and merging T141340 here. It is not urgent, but needs investigation.

jcrespo triaged this task as Unbreak Now! priority.Aug 4 2016, 5:08 PM

This caused for the second time in a week lag on db1036, afecting all s2 users' watchlist, contributions and recent-changes. I think this merits an unbreak now.

(First one was T141340).

As a way to try to mitigate this (but not fully solved, I will try to load balance this host with another to give it double the resources (but this may not work).

Restricted Application added subscribers: Luke081515, TerraCodes. · View Herald TranscriptAug 4 2016, 5:08 PM

Here it is the list of queries (I didn't copy it here to avoid exposing account info), they are not individually that large, maybe a wait for slaves was missing (or it does not wait for slaves with 0 weight but a specific role?)

https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=22a0efeb9f2f362b3d3a04ebc9e34fdd&host=db1018&user=&schema=&hours=1

Change 302952 had a related patch set uploaded (by Jcrespo):
Pool db1074 as backup rc node in case db1036 lags

https://gerrit.wikimedia.org/r/302952

Change 302952 merged by Jcrespo:
Pool db1074 as backup rc node in case db1036 lags

https://gerrit.wikimedia.org/r/302952

aaron added a comment.Aug 4 2016, 7:51 PM

LoadBalancer::waitForAll() does not check 0-load slaves, so if it can't keep up with the other DBs it would end up in trouble.

OTOH, we might not care about 'vslow'/'dump' so much, so checking them all might not be good either. Some middle ground would be best.

jcrespo lowered the priority of this task from Unbreak Now! to High.Aug 4 2016, 8:26 PM

Knowing that, I can put the ones we care with 1 load, fixing the issue by configuration.

Going back to high because of the previous deployment.

aaron closed this task as Resolved.Aug 4 2016, 8:40 PM

I opened T142135 instead, since this isn't specific to rename user.

There was slave lag on Meta-Wiki (s7) causing writes to be locked for more than three minutes a little while ago. I suspect (but not completely sure) that it was also related to a global rename that was done at about the same time. https://meta.wikimedia.org/wiki/Special:CentralAuth/Derakhshan contains 7000+ edits on fawiki which is also s7. Probably worth more investigation.

It's happening again now on s7.

aaron added a comment.EditedAug 7 2016, 10:41 AM

I don't think a 7k row rename matters. I do however see a huge spike in write query time warnings in DBPerformance log on huwiki (also s7) at the moment. They come from recentchanges updates (rc_patrolled updates) in ChangeNotification jobs.

aaron added a comment.Aug 7 2016, 11:14 AM

I made https://gerrit.wikimedia.org/r/303417 for the above problem (the jobs triggered a hook that triggered the patrol flag callback).

I can confirm the long running writes on the master were from queries such as:

UPDATE /* RevisionReviewForm::updateRecentChanges */ 
`recentchanges` SET rc_patrolled = '1' WHERE rc_cur_id = '148' AND rc_type = '1' 
AND (rc_timestamp <= '20160724202301')

https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=cd173668b8769c5edcb653db93cd207c&host=db1041&user=&schema=&hours=6