Page MenuHomePhabricator

Automatically trigger waitForReplication after a sufficiently high number of rows has been written
Closed, ResolvedPublic

Description

T57420: Remove local wiki password hash when CentralAuth has attached account caused a small outage due to a missing waitForReplication call. It might be worth to investigate ways to automatically prevent such errors. The database handle fetches the number of changed rows after write queries. It could keep track and automatically trigger replication wait when a certain threshold is hit. This can only be done when there is no transaction open, but maintenance scripts generally don't have transactions so that would work out.

Event Timeline

Tgr created this task.Oct 1 2018, 4:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 1 2018, 4:58 PM
Tgr added a comment.Oct 1 2018, 4:58 PM

I don't know the DB layer well enough to tell if this is a sane idea... just putting it out there.

aaron added a subscriber: aaron.Oct 1 2018, 7:35 PM

Sometimes callers might want I/O (swift/elastics/blazegraph) near DB I/O transactions, so even if we use setTransactionListener() (like Maintenance) and listen for points where no trx is active anywhere (kind of like DeferredUpdates), we'd want to be careful about waiting for lag too long or erroring out. Then again, mixed source-IO code should generally follow guidelines (https://www.mediawiki.org/wiki/Database_transactions#Updating_secondary_non-RDBMS_stores) and use patterns like doing the key/value writes first and committing or using commit hooks/deferred updates. So...maybe a callback could listen to setTransactionListener(), it could be given the affected row count, and a deferred MergeableUpdate could be added to DeferredUpdates when the count is high for among DBs recently (using pass-by-ref listener callback vars for last-time and running-count or something). The update could wait for replication, and would do so after any related I/O updates that relate to the DB writes.

Have there been similar incidents? It would also be nice to maybe use the Maintenance methods for committing more often in scripts.

Tgr added a comment.Oct 1 2018, 8:48 PM

It would also be nice to maybe use the Maintenance methods for committing more often in scripts.

Maintenance scripts normally run in autocommit mode, right? In the DeleteLocalPasswords case that was fine as it uses a single multi-row UPDATE for each batch.

But in general, yeah, using beginTransaction / commitTransaction is probably nicer than calling waitForReplication manually.

Krinkle added a subscriber: Krinkle.Oct 1 2018, 9:56 PM

We have TransactionProfiler to detect and sometimes disallow/abort reads and writes above a certain threshold. Rather than automatic waits, perhaps we can throw if a certain amount of database work has been performed without waiting for replication to catch-up.

Alternatively (I see the latest comments now), given that begin/commitTransation in Maintenance.php automatically wait, perhaps we can encourage use of those by emitting log-warnings if maintenance scripts interact with the database directly (to be turned into log-errors once we've cleaned up prod, and into exceptions after the next release?).

Krinkle closed this task as Resolved.Jul 23 2019, 6:45 PM
Krinkle claimed this task.

This is done automatically in Maintenance scripts via commitTransaction().

In Jobs, we recommend using transaction tickets which enable use of commitAndWaitForReplication(). This also forces combining committing with replication waits.

For mid-request queries and deferred updates - one shouldn't be processing multiple large batches or waiting for replication in the first place. Queue a job instead :)

Any violations of the above will trigger TransactionProfiler which is channeled to Logstash as "DBPerformance".

Restricted Application added a project: Core Platform Team. · View Herald TranscriptJul 23 2019, 6:45 PM