Page MenuHomePhabricator

Rename user creates lag on enwiki
Closed, ResolvedPublic

Description

This rename of a user with 200K contributions created between 0 and 4 minutes of lag on all enwiki slaves.

Identify why and avoid it in the future.

Event Timeline

jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added projects: Performance Issue, DBA.
jcrespo added a subscriber: jcrespo.

@Legoktm told me that RenameUser.php does not wait for the slaves- fix that in the future.

Legoktm added a subscriber: aaron.

Change 248486 had a related patch set uploaded (by Aaron Schulz):
Make jobs wait for slaves after each batch

https://gerrit.wikimedia.org/r/248486

Change 248486 merged by jenkins-bot:
Make jobs wait for slaves after each batch

https://gerrit.wikimedia.org/r/248486

Change 248583 had a related patch set uploaded (by Aaron Schulz):
Batch the updates done for page delete/restore handling

https://gerrit.wikimedia.org/r/248583

Change 248583 merged by jenkins-bot:
Batch the updates done for page delete/restore handling

https://gerrit.wikimedia.org/r/248583

aaron claimed this task.

Reopened and merging T141340 here. It is not urgent, but needs investigation.

jcrespo triaged this task as Unbreak Now! priority.Aug 4 2016, 5:08 PM

This caused for the second time in a week lag on db1036, afecting all s2 users' watchlist, contributions and recent-changes. I think this merits an unbreak now.

(First one was T141340).

As a way to try to mitigate this (but not fully solved, I will try to load balance this host with another to give it double the resources (but this may not work).

Here it is the list of queries (I didn't copy it here to avoid exposing account info), they are not individually that large, maybe a wait for slaves was missing (or it does not wait for slaves with 0 weight but a specific role?)

https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=22a0efeb9f2f362b3d3a04ebc9e34fdd&host=db1018&user=&schema=&hours=1

Change 302952 had a related patch set uploaded (by Jcrespo):
Pool db1074 as backup rc node in case db1036 lags

https://gerrit.wikimedia.org/r/302952

Change 302952 merged by Jcrespo:
Pool db1074 as backup rc node in case db1036 lags

https://gerrit.wikimedia.org/r/302952

LoadBalancer::waitForAll() does not check 0-load slaves, so if it can't keep up with the other DBs it would end up in trouble.

OTOH, we might not care about 'vslow'/'dump' so much, so checking them all might not be good either. Some middle ground would be best.

jcrespo lowered the priority of this task from Unbreak Now! to High.Aug 4 2016, 8:26 PM

Knowing that, I can put the ones we care with 1 load, fixing the issue by configuration.

Going back to high because of the previous deployment.

I opened T142135 instead, since this isn't specific to rename user.

There was slave lag on Meta-Wiki (s7) causing writes to be locked for more than three minutes a little while ago. I suspect (but not completely sure) that it was also related to a global rename that was done at about the same time. https://meta.wikimedia.org/wiki/Special:CentralAuth/Derakhshan contains 7000+ edits on fawiki which is also s7. Probably worth more investigation.

It's happening again now on s7.

I don't think a 7k row rename matters. I do however see a huge spike in write query time warnings in DBPerformance log on huwiki (also s7) at the moment. They come from recentchanges updates (rc_patrolled updates) in ChangeNotification jobs.

I made https://gerrit.wikimedia.org/r/303417 for the above problem (the jobs triggered a hook that triggered the patrol flag callback).

I can confirm the long running writes on the master were from queries such as:

UPDATE /* RevisionReviewForm::updateRecentChanges */ 
`recentchanges` SET rc_patrolled = '1' WHERE rc_cur_id = '148' AND rc_type = '1' 
AND (rc_timestamp <= '20160724202301')

https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=cd173668b8769c5edcb653db93cd207c&host=db1041&user=&schema=&hours=6