Page MenuHomePhabricator

DatabaseMysql.php should have debug output for getLag()
Closed, ResolvedPublic


Author: nmaul

This function does a 'SHOW PROCESSLIST' on the current connection. It then looks for all lines for the user 'system user', and throws away a selection of them that don't match the intended criteria. Once all extraneous rows are eliminated, it returns the time on the first remaining row (probably only one).

The problem with this is that the current connection may or may not have the PROCESS permission on the database server. This can lead to no matching rows, and ultimately the function returns 'false'. That in itself is the proper behavior (it should return *something* indicating a failure, and false seems as good as anything). However, it logs no debugging information as to what happened.

This percolates up to LoadMonitor.php (and ultimately LoadBalancer.php), which detects whether a slave is lagged or not. 'False' values will result in the wiki going into read-only mode and never recovering, with zero information as to why this is. Even wgDebugLogFile gives no information as to why Mediawiki thinks the slaves are all lagged.

The proper fix IMO is to add logging in 2 places, reword logs in a 3rd place, and update the overall setup documentation:

DatabaseMysql.php: should call wfDebug if no suitable 'system user' row was found, possibly hinting about the PROCESS permission flag.

LoadMonitor.php: should indicate (wfDebug) that this db server is unsuitable because it failed outright (as opposed to actually being lagged). This should probably be a loop through the $times array, somewhere near the end of the getLagTimes function. it could be at the end of the existing $serverIndexes foreach() loop to avoid adding an extra loop.

LoadBalancer.php: should be changed from 'Server #$i is not replicating" to "Server #$i replication status undeterminable, removing from pool" or similar. That is, it shouldn't jump to conclusions, and it should explicitly state that the server is being removed from the list.

LoadBalancer.php: might be worth considering renaming the 'wgReadOnly' status set by the getReaderIndex() function, in the big do..while loop. Possibly something to point administrators at wgDebugLogFile for more info? Not sure how much we want to expose to end users. This only makes sense if some of the above changes are incorporated... otherwise the debug log file isn't all that helpful.

Finally, the install/update documentation should be updated to reflect that the PROCESS privilege is needed on multi-database systems for detection of slave lag.

I don't know when or why this broke, but I can see that a good bit of things were shuffled around in 1.16->1.17, and our 1.16.5 install on MySQL 5.0.77 works properly.

1.16 on MySQL 5.1 works. Strangely, as far as I can tell it appears to do the same basic thing, but the 'Server X is not replicating' error is never logged, the wiki doesn't go into read-only mode, and the slave server get queries normally.

Version: 1.17.x
Severity: normal


Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:55 PM
bzimport added a project: Wikimedia-Rdbms.
bzimport set Reference to bz30257.
bzimport added a subscriber: Unknown Object (MLST).

Note in 1.18 this was changed to be able to use SHOW SLAVE STATUS for newer versions of MySQL

But was commented out due to it causing errors for other people

nmaul wrote:

Neither is a particular guaranteed method for measuring lag. "Seconds behind master" is generally okay, but it actually measure the time between the IO thread and the SQL thread on the slave. If the IO thread is behind the master, this measurement won't get notice. That's kinda rare, but more common in high latency or lossy networks.

The most guaranteed method I know of is to make a change on the master and time how long it takes to appear on the slave. Maatkit has a good general test of this that works with all types of replication (even offline mysqldump-restore style "replication"):

The basic strategy is to update a row on the master, with the current timestamp. Then on the slave, check the current system time and compare with the latest timestamp in the row. The difference is the replication lag. This relies on having synchronized clocks, obviously.

A similar mechanism could probably be adopted. Besides the obvious win of truly accurate timing, it also requires no special permissions, as both processlist and 'show slave status' would need to have.

Please feel free to supply patches :)

nmaul wrote:

(In reply to comment #3)

Please feel free to supply patches :)

Will do. Personal life getting in the way lately. :)

Please don't reset priorities to "unprioritized". I've changed it to "low" priority.

Change 264015 had a related patch set uploaded (by Aaron Schulz):
Add better error log for DB getLag() calls

Change 264015 merged by Aaron Schulz:
Add better error logging for DB getLag() calls