Maniphest T199383

WaitConditionLoop callers need to log on timeout
Open, LowPublic
Actions

Assigned To

None

Authored By

	• tstarling
	Jul 11 2018, 9:59 PM

Description

Pretty much every WaitConditionLoop caller needs to check the result for CONDITION_TIMED_OUT, and if it is found, log an informative message.

This is an action item from the API cluster overload observed today. The only visibility we had into the cause of the outage was Xenon (which has poor time resolution) and perf (which is tedious). If something on the production cluster is reaching a timeout after several seconds of waiting, we need to know that.

WaitConditionLoop itself is not capable of logging because it is a separate library with no PSR-3 logger passed to its constructor.

Details

	Subject	Repo	Branch	Lines +/-
	objectcache: improve logging and error handling in BagOStuff	mediawiki/core	wmf/1.32.0-wmf.14	+56 -11
	objectcache: improve logging and error handling in BagOStuff	mediawiki/core	master	+56 -11

Customize query in gerrit

Event Timeline

• tstarling created this task.Jul 11 2018, 9:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 11 2018, 9:59 PM

Krinkle added a project: MediaWiki-libs-WaitConditionLoop.Jul 11 2018, 10:00 PM

Krinkle edited projects, added Wikimedia-Incident; removed MediaWiki-libs-WaitConditionLoop.Jul 12 2018, 2:41 AM

Krinkle moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.

• tstarling moved this task from Inbox to Backlog on the Core-Platform-Team-Old board.Jul 17 2018, 1:59 AM

• Imarlier moved this task from Inbox, needs triage to Radar on the Performance-Team board.Jul 17 2018, 8:08 PM

• Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.

Krinkle added a project: MediaWiki-General.Jul 17 2018, 8:09 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Jul 18 2018, 12:14 AM

• tstarling triaged this task as Low priority.Jul 18 2018, 6:28 AM

Change 445427 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/445427

gerritbot added a project: Patch-For-Review.Jul 24 2018, 7:11 PM

Change 445427 merged by jenkins-bot:
[mediawiki/core@master] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/445427

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-07-31 (1.32.0-wmf.15)).Jul 25 2018, 5:00 AM

Change 448172 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.14] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/448172

Change 448172 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.14] objectcache: improve logging and error handling in BagOStuff

https://gerrit.wikimedia.org/r/448172

ReleaseTaggerBot edited projects, added MW-1.32-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)); removed MW-1.32-notes (WMF-deploy-2018-07-31 (1.32.0-wmf.15)).Jul 27 2018, 2:00 AM

Are the above patches enough, or do we need to modify the WaitConditionLoop library to support logging?

The above patches only address BagOStuff. There is also:

LockManager. For example RevDelFileItem and RevDelList acquire file locks with a 10 second timeout, apparently with no log message if the timeout is reached.
DatabasePostgres should probably log an info level message to the query logger like DatabaseMysqlBase on lock timeout
EtcdConfig and ChronologyProtector give vague or misleading log messages, they don't make it clear that there was a timeout.

In general, I think the log message should include the timeout value, like Aaron did for BagOStuff.

Given the small number of callers, I don't think it's necessary to modify the library. The callers can probably do a better job of constructing a meaningful log message.

CCicalese_WMF added a project: Platform Team Legacy (Later).Oct 1 2018, 11:31 PM

CCicalese_WMF edited projects, added Core Platform Team Goals (MCR: Uncategorized); removed Core-Platform-Team-Old.Oct 2 2018, 12:27 AM

CCicalese_WMF edited projects, added Platform Engineering; removed Core Platform Team Goals (MCR: Uncategorized).Oct 2 2018, 2:29 AM

CCicalese_WMF moved this task from Inbox to Triage Meeting Inbox on the Platform Engineering board.Oct 2 2018, 2:40 AM

CCicalese_WMF moved this task from Triage Meeting Inbox to Needs Cleaning - Security, stability, performance, and scalability (TEC1) on the Platform Engineering board.Oct 3 2018, 3:55 PM

CCicalese_WMF edited projects, added Platform Engineering (Needs Cleaning - Security, stability, performance, and scalability (TEC1)); removed Platform Engineering.

Ladsgroup removed a project: Patch-For-Review.May 28 2019, 3:37 PM

• WDoranWMF moved this task from Later to Mop Column on the Platform Team Legacy board.Jul 5 2019, 3:46 PM

• WDoranWMF removed a project: Platform Team Legacy (Later).

CCicalese_WMF edited projects, added Platform Team Workboards (Clinic Duty Team); removed Platform Engineering (Needs Cleaning - Security, stability, performance, and scalability (TEC1)).Nov 20 2019, 7:16 PM

• WDoranWMF moved this task from Inbox to Backlog on the Platform Team Workboards (Clinic Duty Team) board.Mar 12 2020, 1:35 PM

• AMooney moved this task from Backlog to Teleport on the Platform Team Workboards (Clinic Duty Team) board.Mar 24 2020, 1:47 PM