Page MenuHomePhabricator

Enforce database transaction rollback conventions to protect against certain try/catch patterns
Closed, ResolvedPublic

Description

I think "must rollback after error" behavior would be nice to have enforced more so in MW. Right now there are just conventions:

If an error occured, then either that statement *or* everything in the transaction must be rolled back (depending on RDBMs, its server config, and the query). Without knowing what, if anything, can be salvaged, the best thing is to rollback everything, included other "peer" transactions active (e.g. LBFactory::rollbackMasterChanges).

To do

Example:

                BEGIN
[PHP func X]	write A_X
[PHP func Y]	try {
                write B_Y (FAILED, transaction reset to A_X, BEGIN, or gone?)
                } catch ( ... ) {
                <calls rollbackMasterChanges()>
                ROLLBACK
                <rethrows error>
                }
                <any higher try/catch call rollbackMasterChanges() too>
onTransaction(PreCommitOr)Idle:
[PHP func X]	<cancelled>
AtomicSectionUpdate/AutoCommitUpdate:
[PHP func Z]	<cancelled>

Writes ultimately applied: [] (SAFE)

The ability to try/catch and keep writing to mysql always scared me,
since it could horribly botch things depending on the case...

Case I: failing query aborts only itself (e.g. duplicate key)

                BEGIN
[PHP func X]	write A_X
[PHP func Y]	try {
                write B_Y (FAILED, server rolls back to A_X)
                } catch ( ... ) {
                write C_Y
                }
[PHP func Z]	write D_Z
                COMMIT
onTransaction(PreCommitOr)Idle:
[PHP func X]	write E_X [AUTO_COMMIT]
AtomicSectionUpdate/AutoCommitUpdate:
[PHP func Z]	BEGIN
                write F_Z
                COMMIT

Writes ultimately applied: [A_X, C_Y, D_Z, E_X, F_Z] (SAFE)

Case II: failing query aborts transaction content (e.g. deadlock)

                BEGIN
[PHP func X]	write A_X
[PHP func Y]	try {
                write B_Y (FAILED, server rolls back to BEGIN)**
                } catch ( ... ) {
                write C_Y
                }
[PHP func Z]	write D_Z
                COMMIT
onTransaction(PreCommitOr)Idle:
[PHP func X]	<cancelled>
AtomicSectionUpdate:
[PHP func Z]	<cancelled>

Writes ultimately applied: [C_Y, D_Z] (UNSAFE; what if A <=> D?)

** I see a bug where trxLevel is set to 0 by a wasDeadlock() check,
   but the BEGIN survives server side; the code should issue ROLLBACK.
   The above assumes this is fixed.

Case III: failing query aborts entire transaction (e.g. server has gone away)

                BEGIN
[PHP func X]	write A_X
[PHP func Y]	try {
                write B_Y (FAILED, server already rolled back transaction on disconnect)
                <reconnect> [DB layer]
                BEGIN [DB layer]
                <no retry of write due to pending A_X write at the time>
                } catch ( ... ) {
                write C_Y
                }
[PHP func Z]	write D_Z
                COMMIT
onTransaction(PreCommitOr)Idle:
[PHP func X]	<cancelled>
AtomicSectionUpdate:
[PHP func Z]	<cancelled>

Writes ultimately applied: [C_Y, D_Z] (UNSAFE; what if A <=> D?)

Some mechanisms to prevent this bugs should be added. Once an error happens, the IDatabase handle should be in an error state, disallowing most queries.

To get out of the error state *using* ROLLBACK, perhaps:

  • The owner of the transaction is explicit and issues rollback()
  • A rollback() is issued with FLUSHING_INTERNAL (e.g. from LB/LBFactory)

To get it out of an error state other than ROLLBACK, perhaps:

  • For SQL errors known to cause the server to rollback only the failed query:
    • An atomic section was active and the owner issues resumeAtomic() or endAtomic()
    • An atomic section with a savepoint was active (https://gerrit.wikimedia.org/r/#/c/420219) and the savepoint owner issues cancelAtomic()
    • The owner of the transaction issues commit()
    • A commit() is issued with FLUSHING_INTERNAL (e.g. from LB/LBFactory)
  • For SQL errors known to cause the server to rollback everything
    • (no way to continue aside from rollback)
  • For SQL that might or might not cause the server to rollback everything
    • (no way to continue aside from rollback)

We sort-of enforce who can rollback in that a rollback() triggered by some other code between a functions's startAtomic('x') and endAtomic('x') calls will result in an error ('Invalid atomic section') on endAtomic()...if that bubbles up it will cause rollback. Unexpected commit()s like that also error out. However, not everything is wrapped in atomic sections, or related writes might be in three separate atomic sections. There is also the check in rollback() against DBO_TRX for multi-DB transaction rounds.

Event Timeline

aaron created this task.Mar 19 2018, 1:34 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 19 2018, 1:34 AM
aaron updated the task description. (Show Details)Mar 19 2018, 2:25 AM
Anomie added a subscriber: Anomie.Mar 19 2018, 4:18 PM

I wouldn't have a problem with requiring some explicit action by the caller to continue using a transaction after error in case-I. But you'd have to find all the existing code relying on the case-I behavior to fix it. Including any code that's using DBO_IGNORE or the like (which personally I wish we'd deprecate and remove, it'd make a lot of other code easier to reason about).

I note that PostgreSQL by default has this behavior at the database level, with only ROLLBACK and ROLLBACK TO SAVEPOINT clearing it, which is why I submitted https://gerrit.wikimedia.org/r/420220 to make it behave like MySQL. If we go with a matching "only ->rollback() or ->cancelAtomic() clear it", we could abandon/revert that patch. If you allow other ways to clear the state, though, we'd still need to keep that PG patch. Personally, I don't know how much sense it makes for a commit() to commit errors when the point of this task is that such a commit is potentially unsafe.

A good first step to avoiding case-II and case-III would be to have them throw an exception that isn't going to be caught by a catch block looking for case-I exceptions.

DBO_IGNORE can only be enabled through the config or Database::factory directly. This flag is now largely irrelevant and could probably be finished off with a deprecation.

Change 421694 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: ignore DBO_IGNORE if passed into Database::__construct()

https://gerrit.wikimedia.org/r/421694

Krinkle updated the task description. (Show Details)Mar 26 2018, 9:42 PM
Krinkle moved this task from Untriaged to Rdbms library on the Wikimedia-Rdbms board.

Change 421694 merged by jenkins-bot:
[mediawiki/core@master] rdbms: ignore DBO_IGNORE if passed into Database::__construct()

https://gerrit.wikimedia.org/r/421694

Krinkle updated the task description. (Show Details)Mar 26 2018, 9:53 PM

Change 421496 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: make Database query error handling more strict

https://gerrit.wikimedia.org/r/421496

Change 421496 merged by jenkins-bot:
[mediawiki/core@master] rdbms: make Database query error handling more strict

https://gerrit.wikimedia.org/r/421496

Change 424375 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] rdbms: Roll back empty implicit transaction on error

https://gerrit.wikimedia.org/r/424375

Change 424377 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] rdbms: Issue a deprecation warning if errors are ignored

https://gerrit.wikimedia.org/r/424377

Change 424375 merged by jenkins-bot:
[mediawiki/core@master] rdbms: Roll back empty implicit transaction on error

https://gerrit.wikimedia.org/r/424375

Change 424377 merged by jenkins-bot:
[mediawiki/core@master] rdbms: Issue a deprecation warning if errors are ignored

https://gerrit.wikimedia.org/r/424377

@Krinkle found several errors in production where this was being enforced properly, which suggests that this should be ready to be resolved. When @aaron gets back from vacation we'll confirm that.

Imarlier assigned this task to aaron.Oct 1 2018, 8:22 PM
Imarlier moved this task from Inbox to Doing on the Performance-Team board.
aaron closed this task as Resolved.Oct 2 2018, 8:59 PM
aaron updated the task description. (Show Details)