Author: nmaul
Description:
This function does a 'SHOW PROCESSLIST' on the current connection. It then looks for all lines for the user 'system user', and throws away a selection of them that don't match the intended criteria. Once all extraneous rows are eliminated, it returns the time on the first remaining row (probably only one).
The problem with this is that the current connection may or may not have the PROCESS permission on the database server. This can lead to no matching rows, and ultimately the function returns 'false'. That in itself is the proper behavior (it should return *something* indicating a failure, and false seems as good as anything). However, it logs no debugging information as to what happened.
This percolates up to LoadMonitor.php (and ultimately LoadBalancer.php), which detects whether a slave is lagged or not. 'False' values will result in the wiki going into read-only mode and never recovering, with zero information as to why this is. Even wgDebugLogFile gives no information as to why Mediawiki thinks the slaves are all lagged.
The proper fix IMO is to add logging in 2 places, reword logs in a 3rd place, and update the overall setup documentation:
DatabaseMysql.php: should call wfDebug if no suitable 'system user' row was found, possibly hinting about the PROCESS permission flag.
LoadMonitor.php: should indicate (wfDebug) that this db server is unsuitable because it failed outright (as opposed to actually being lagged). This should probably be a loop through the $times array, somewhere near the end of the getLagTimes function. it could be at the end of the existing $serverIndexes foreach() loop to avoid adding an extra loop.
LoadBalancer.php: should be changed from 'Server #$i is not replicating" to "Server #$i replication status undeterminable, removing from pool" or similar. That is, it shouldn't jump to conclusions, and it should explicitly state that the server is being removed from the list.
LoadBalancer.php: might be worth considering renaming the 'wgReadOnly' status set by the getReaderIndex() function, in the big do..while loop. Possibly something to point administrators at wgDebugLogFile for more info? Not sure how much we want to expose to end users. This only makes sense if some of the above changes are incorporated... otherwise the debug log file isn't all that helpful.
Finally, the install/update documentation should be updated to reflect that the PROCESS privilege is needed on multi-database systems for detection of slave lag.
I don't know when or why this broke, but I can see that a good bit of things were shuffled around in 1.16->1.17, and our 1.16.5 install on MySQL 5.0.77 works properly.
1.16 on MySQL 5.1 works. Strangely, as far as I can tell it appears to do the same basic thing, but the 'Server X is not replicating' error is never logged, the wiki doesn't go into read-only mode, and the slave server get queries normally.
Version: 1.17.x
Severity: normal