Page MenuHomePhabricator

WaitConditionLoop callers need to log on timeout
Open, LowPublic

Description

Pretty much every WaitConditionLoop caller needs to check the result for CONDITION_TIMED_OUT, and if it is found, log an informative message.

This is an action item from the API cluster overload observed today. The only visibility we had into the cause of the outage was Xenon (which has poor time resolution) and perf (which is tedious). If something on the production cluster is reaching a timeout after several seconds of waiting, we need to know that.

WaitConditionLoop itself is not capable of logging because it is a separate library with no PSR-3 logger passed to its constructor.

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 11 2018, 9:59 PM
Imarlier moved this task from Inbox to Radar on the Performance-Team board.Jul 17 2018, 8:08 PM
Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.
tstarling triaged this task as Low priority.Jul 18 2018, 6:28 AM

Change 445427 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/445427

Change 445427 merged by jenkins-bot:
[mediawiki/core@master] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/445427

Change 448172 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.14] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/448172

Change 448172 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.14] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/448172

Are the above patches enough, or do we need to modify the WaitConditionLoop library to support logging?

The above patches only address BagOStuff. There is also:

  • LockManager. For example RevDelFileItem and RevDelList acquire file locks with a 10 second timeout, apparently with no log message if the timeout is reached.
  • DatabasePostgres should probably log an info level message to the query logger like DatabaseMysqlBase on lock timeout
  • EtcdConfig and ChronologyProtector give vague or misleading log messages, they don't make it clear that there was a timeout.

In general, I think the log message should include the timeout value, like Aaron did for BagOStuff.

Given the small number of callers, I don't think it's necessary to modify the library. The callers can probably do a better job of constructing a meaningful log message.