Rename user creates lag on enwiki
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jcrespo
	Oct 23 2015, 7:32 PM

Description

This rename of a user with 200K contributions created between 0 and 4 minutes of lag on all enwiki slaves.

Identify why and avoid it in the future.

Details

Subject	Repo	Branch	Lines +/-
Pool db1074 as backup rc node in case db1036 lags	operations/mediawiki-config	master	+11 -6
Batch the updates done for page delete/restore handling	mediawiki/extensions/Renameuser	master	+10 -4
Make jobs wait for slaves after each batch	mediawiki/extensions/Renameuser	master	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	None	T3268 Database replication lag issues (tracking)
Duplicate	None	T108551 Database locked error while publishing article using CX
Resolved	aaron	T95501 Fix causes of replica lag and get it to under 5 seconds at peak
Resolved	aaron	T116425 Rename user creates lag on enwiki

Event Timeline

• jcrespo created this task.Oct 23 2015, 7:32 PM

• jcrespo raised the priority of this task from to Needs Triage.

• jcrespo updated the task description. (Show Details)

• jcrespo added projects: Performance Issue, DBA.

• jcrespo subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 23 2015, 7:32 PM

Krenair added a project: MediaWiki-User-rename.Oct 23 2015, 7:34 PM

Krenair subscribed.

Legoktm subscribed.Oct 23 2015, 7:36 PM

@Legoktm told me that RenameUser.php does not wait for the slaves- fix that in the future.

Legoktm set Security to None.Oct 23 2015, 7:39 PM

Legoktm added a subscriber: aaron.

• jcrespo added a parent task: T95501: Fix causes of replica lag and get it to under 5 seconds at peak.Oct 23 2015, 7:40 PM

Change 248486 had a related patch set uploaded (by Aaron Schulz):
Make jobs wait for slaves after each batch

https://gerrit.wikimedia.org/r/248486

gerritbot added a project: Patch-For-Review.Oct 23 2015, 9:06 PM

Change 248486 merged by jenkins-bot:
Make jobs wait for slaves after each batch

https://gerrit.wikimedia.org/r/248486

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2015-10-27_(1.27.0-wmf.4)).Oct 23 2015, 10:00 PM

Change 248583 had a related patch set uploaded (by Aaron Schulz):
Batch the updates done for page delete/restore handling

https://gerrit.wikimedia.org/r/248583

Change 248583 merged by jenkins-bot:
Batch the updates done for page delete/restore handling

https://gerrit.wikimedia.org/r/248583

aaron closed this task as Resolved.Nov 12 2015, 8:35 PM

aaron claimed this task.

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 12 2015, 8:35 PM

Reopened and merging T141340 here. It is not urgent, but needs investigation.

• jcrespo merged a task: T141340: /* RenameUserJob::run */ of user Burgundo created >1000 seconds of lag on itwiki (db1036), impacting recentchanges, watchlist and log visualization of all s2 wikis.Aug 1 2016, 4:18 PM

• jcrespo added a subscriber: Glaisher.

This caused for the second time in a week lag on db1036, afecting all s2 users' watchlist, contributions and recent-changes. I think this merits an unbreak now.

(First one was T141340).

As a way to try to mitigate this (but not fully solved, I will try to load balance this host with another to give it double the resources (but this may not work).

Restricted Application added subscribers: Luke081515, TerraCodes. · View Herald TranscriptAug 4 2016, 5:08 PM

• jcrespo edited projects, added Wikimedia-Incident; removed MW-1.27-release (WMF-deploy-2015-10-27_(1.27.0-wmf.4)), Patch-For-Review.Aug 4 2016, 5:12 PM

Here it is the list of queries (I didn't copy it here to avoid exposing account info), they are not individually that large, maybe a wait for slaves was missing (or it does not wait for slaves with 0 weight but a specific role?)

https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=22a0efeb9f2f362b3d3a04ebc9e34fdd&host=db1018&user=&schema=&hours=1

Change 302952 had a related patch set uploaded (by Jcrespo):
Pool db1074 as backup rc node in case db1036 lags

https://gerrit.wikimedia.org/r/302952

gerritbot added a project: Patch-For-Review.Aug 4 2016, 5:59 PM

Change 302952 merged by Jcrespo:
Pool db1074 as backup rc node in case db1036 lags

https://gerrit.wikimedia.org/r/302952

LoadBalancer::waitForAll() does not check 0-load slaves, so if it can't keep up with the other DBs it would end up in trouble.

OTOH, we might not care about 'vslow'/'dump' so much, so checking them all might not be good either. Some middle ground would be best.

Knowing that, I can put the ones we care with 1 load, fixing the issue by configuration.

Going back to high because of the previous deployment.

I opened T142135 instead, since this isn't specific to rename user.

There was slave lag on Meta-Wiki (s7) causing writes to be locked for more than three minutes a little while ago. I suspect (but not completely sure) that it was also related to a global rename that was done at about the same time. https://meta.wikimedia.org/wiki/Special:CentralAuth/Derakhshan contains 7000+ edits on fawiki which is also s7. Probably worth more investigation.

It's happening again now on s7.

I don't think a 7k row rename matters. I do however see a huge spike in write query time warnings in DBPerformance log on huwiki (also s7) at the moment. They come from recentchanges updates (rc_patrolled updates) in ChangeNotification jobs.

I made https://gerrit.wikimedia.org/r/303417 for the above problem (the jobs triggered a hook that triggered the patrol flag callback).

I can confirm the long running writes on the master were from queries such as:

UPDATE /* RevisionReviewForm::updateRecentChanges */ 
`recentchanges` SET rc_patrolled = '1' WHERE rc_cur_id = '148' AND rc_type = '1' 
AND (rc_timestamp <= '20160724202301')

https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=cd173668b8769c5edcb653db93cd207c&host=db1041&user=&schema=&hours=6

Rename user creates lag on enwikiClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Rename user creates lag on enwiki
Closed, ResolvedPublic
Actions

Related Objects
Search...