Page MenuHomePhabricator

No user-visible lag reports when database slave server has stopped replication slave thread
Closed, ResolvedPublic


Sometimes a slave server stops replicating, for instance due to some transitory funky error:

   Slave_IO_Running: Yes
  Slave_SQL_Running: No
         Last_errno: 1205
         Last_error: Error 'Lock wait timeout exceeded; Try restarting transaction' on query. Default database: 'enwiki'. Query: 'UPDATE /* HTMLCacheUpdate::invalidateIDs This flag once ... */  `page` SET page_touched = '20090127180707' WHERE (page_id IN ('14890591'))'

In this case, there's no end-user-visible report of lag, but weird things happen such as a failure to show updated information on Special:Contributions.

After restarting the slave thread, we get a nice big warning like this:

Due to high database server lag, changes newer than 2146 seconds might not be shown in this list.

which is neat. It would be nice to have a similar warning if we're pulling from a server that's outright not replicating... it may be difficult to tell how far behind it is in this case, but even a "we're broken" warning would be nice.

Note that the lag report in the API shows up "" instead of say "0" for this case:

whereas the 'lagtop' script reports a 0. Lagtop perhaps should be updated to show a visible warning as well if this is detectable.

Version: unspecified
Severity: enhancement

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:28 PM
bzimport added a project: Wikimedia-Rdbms.
bzimport set Reference to bz17179.
bzimport added a subscriber: Unknown Object (MLST).

Change 241133 had a related patch set uploaded (by Aaron Schulz):
Added pt-heartbeat support to DatabaseMysqlBase

Change 241133 merged by jenkins-bot:
Added pt-heartbeat support to DatabaseMysqlBase

jcrespo assigned this task to aaron.
jcrespo added a subscriber: jcrespo.

I would consider this resolved thanks to performance team's patches regarding pt-heartbeat (shown above).