wmf.14 Blocker - Post Mortem - Cannot flush pre-lock snapshot because writes are pending
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Jrbranaa
	Aug 16 2017, 11:52 PM

Description

Post mortem meeting to discuss the events and learnings related to the LinksUpdate::acquirePageLock: train blocking incident.

Incident Report:

Status	Subtype	Assigned	Task
Resolved	Release	thcipriani	T170632 1.30.0-wmf.14 deployment blockers
Resolved		aaron	T175897 Audit and improve JobQueue stability and performance (2017)
Resolved	PRODUCTION ERROR	aaron	T173462 LinksUpdate::acquirePageLock: Cannot flush pre-lock snapshot because writes are pending
Resolved		Jrbranaa	T173477 wmf.14 Blocker - Post Mortem - Cannot flush pre-lock snapshot because writes are pending

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Jrbranaa triaged this task as Medium priority.Sep 1 2017, 3:42 AM

Hello @aaron,

Per our chat. I'd like to get some more info from you if possible. We could do it async or in a quick meeting if you'd prefer.

I'm essentially looking for the following info:

Brief summary of the problem (not the symptoms)
How the problem was introduced (was it a specific change set, combination of changes, etc...)
How could problem have been avoided?
In the incident report under conclusions, there's note of some additional testing that could be done and also the potential for automation. Are either of these active tasks or recommendations (if tasks, can you provided Phab task IDs please)?

Again, I'm fine meeting and talking through these things if you'd rather, just let me know and I'll set something up.

Thanks,

PROBLEM: in LinksUpdate, runForTitle() starting off with acquirePageLock(), then calling doUpdate() for the secondary update list, and returning without committing. This meant that any caller using this method inside a loop had to call commitMasterChanges() itself somehow, otherwise, the acquirePageLock() call would fail. The multi-title case of RefreshLinksJob had a for-loop that did not do this. Note that acquirePageLock() uses getScopedLockAndFlush() which is intended for "critical sections" (https://en.wikipedia.org/wiki/Critical_section) involving read/writing to the database. Since it makes to sense to acquire a lock and then read a stale snapshot (from REPEATABLE-READ) from *before* lock acquisition, Database demands that any transaction be cleared. It will do so automatically if there are no writes, but otherwise it fails since committing prematurely may break atomicity.
INTRODUCTION: This was broken since 63a3911a67507731695bad3188f486219a563b7d but nothing used multi-title refreshlinks jobs. 0df49eeaf49dcd84cee5afc678de43ebd6c984c5 introduced a use case for this and made the bug manifest itself.
AVOIDANCE: since this would seem to happen for any multi-tutle job run, I'm not sure how this got past testing unless there were (a) no links updated and the test jobs were triggered by null or non-link changed edits or (b) the edited test entities only had one backlink. Future backlink change propagation testing should cover these cases.

There are not tasks about testing, just recommendations.