|Declined||None||T3268 Database replication lag issues (tracking)|
|Duplicate||None||T108551 Database locked error while publishing article using CX|
|Resolved||aaron||T95501 Fix causes of replica lag and get it to under 5 seconds at peak|
|Resolved||aaron||T116425 Rename user creates lag on enwiki|
This caused for the second time in a week lag on db1036, afecting all s2 users' watchlist, contributions and recent-changes. I think this merits an unbreak now.
(First one was T141340).
As a way to try to mitigate this (but not fully solved, I will try to load balance this host with another to give it double the resources (but this may not work).
Here it is the list of queries (I didn't copy it here to avoid exposing account info), they are not individually that large, maybe a wait for slaves was missing (or it does not wait for slaves with 0 weight but a specific role?)
LoadBalancer::waitForAll() does not check 0-load slaves, so if it can't keep up with the other DBs it would end up in trouble.
OTOH, we might not care about 'vslow'/'dump' so much, so checking them all might not be good either. Some middle ground would be best.
There was slave lag on Meta-Wiki (s7) causing writes to be locked for more than three minutes a little while ago. I suspect (but not completely sure) that it was also related to a global rename that was done at about the same time. https://meta.wikimedia.org/wiki/Special:CentralAuth/Derakhshan contains 7000+ edits on fawiki which is also s7. Probably worth more investigation.
I don't think a 7k row rename matters. I do however see a huge spike in write query time warnings in DBPerformance log on huwiki (also s7) at the moment. They come from recentchanges updates (rc_patrolled updates) in ChangeNotification jobs.
I can confirm the long running writes on the master were from queries such as:
UPDATE /* RevisionReviewForm::updateRecentChanges */ `recentchanges` SET rc_patrolled = '1' WHERE rc_cur_id = '148' AND rc_type = '1' AND (rc_timestamp <= '20160724202301')