Page MenuHomePhabricator

WaitConditionLoop callers need to log on timeout
Open, LowPublic

Description

Pretty much every WaitConditionLoop caller needs to check the result for CONDITION_TIMED_OUT, and if it is found, log an informative message.

This is an action item from the API cluster overload observed today. The only visibility we had into the cause of the outage was Xenon (which has poor time resolution) and perf (which is tedious). If something on the production cluster is reaching a timeout after several seconds of waiting, we need to know that.

WaitConditionLoop itself is not capable of logging because it is a separate library with no PSR-3 logger passed to its constructor.

Event Timeline

Change 445427 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/445427

Change 445427 merged by jenkins-bot:
[mediawiki/core@master] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/445427

Change 448172 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.14] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/448172

Change 448172 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.14] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/448172

Are the above patches enough, or do we need to modify the WaitConditionLoop library to support logging?

The above patches only address BagOStuff. There is also:

  • LockManager. For example RevDelFileItem and RevDelList acquire file locks with a 10 second timeout, apparently with no log message if the timeout is reached.
  • DatabasePostgres should probably log an info level message to the query logger like DatabaseMysqlBase on lock timeout
  • EtcdConfig and ChronologyProtector give vague or misleading log messages, they don't make it clear that there was a timeout.

In general, I think the log message should include the timeout value, like Aaron did for BagOStuff.

Given the small number of callers, I don't think it's necessary to modify the library. The callers can probably do a better job of constructing a meaningful log message.