Page MenuHomePhabricator

Disallow db->update() without condition
Closed, ResolvedPublic1 Estimated Story Points

Description

This is follow-up from attending Conf42 (Chaos Engineering) in London this week.

Is there a use case for our db abstraction layer allowing $db->update() without any conditions? In other words to submit a query like UPDATE some_table SET x=y that will indescriminately substitute an entire column with a particular column.

While I can think of hypothetical edge cases (e.g where y is not a hardcoded value but something dynamic like z + 1), perhaps they are rare enough to require bypassing update() for (in favour of query()).

In addition, there is also the performance and replication-safe aspect. Not having conditions also implies not having a limit or determinstic outcome, which means it is subject to race conditions and cannot be safely replicated to other databases. Our general practice is that, even if we do want to change an entire table, to select its primary key and then do the update in batches, which means at the very least there would be a WHERE condition for _id IN (,,,,,,).

Event Timeline

Overall this seems sensible enough (although someone could just set a condition of 1=1 or the like to get around it). A few comments though:

perhaps they are rare enough to require bypassing update() for (in favour of query()).

IMO we should never encourage use of query(), that's a code smell for probable MySQLisms.

Not having conditions also implies not having a [...] determinstic outcome

How so? Our batching is generally for performance (i.e. avoiding locking the DB while updating millions of rows), as far as I know, not non-determinism.

Not having conditions also implies not having a [...] determinstic outcome

How so? Our batching is generally for performance (i.e. avoiding locking the DB while updating millions of rows), as far as I know, not non-determinism.

Replication safety, determinism, query performance, and lock contention avoidance do often go hand-in-hand in my experience.

Having said that, I do recall on several ocasions where batching and query perf were not the primary motivator for a change. For a table like thing (id, foo, bar, baz) a write query that does not utilize a primary key, like UPDATE thing SET bar=42 WHERE bar=x AND baz=y is as I understand it considered an anti-pattern by our DBAs. In part because they are not replication-safe and indeed not (obviously) deterministic. I believe it would be a improvement to change such query to first SELECT the identifiers and then perform the write based by primary key. (Even without batching, although batching might make it even better still.)

[…] perhaps they are rare enough to require bypassing update() for (in favour of query()).

IMO we should never encourage use of query(), that's a code smell for probable MySQLisms.

I agree. My suggestion was for cases where the query is already MySQL-specific. The ability to bypass our abstractions is not unique to query(), one can do this for most parameters to update() and select() as well, e.g. by passing strings instead of descriptive arrays for the value, or conditional etc. If these raw cases already use raw queries, it might be fair to require them to bypass it entirely. The idea being to "make the common thing easy, and the rest possible" (but harder). Another way to accomplish that would be to introduce some kind of "I know what I'm doing" flag, e.g. update( 'things', [ 'key' => 'newval' ], $db::UNCONDITIONAL_UPDATE ) where the constant takes the place of the $conds parameter.

Krinkle triaged this task as Medium priority.Jan 27 2020, 5:32 PM

Not having conditions also implies not having a [...] determinstic outcome

How so? Our batching is generally for performance (i.e. avoiding locking the DB while updating millions of rows), as far as I know, not non-determinism.

Replication safety, determinism, query performance, and lock contention avoidance do often go hand-in-hand in my experience.

For a table like thing (id, foo, bar, baz) a write query that does not utilize a primary key, like UPDATE thing SET bar=42 WHERE bar=x AND baz=y is as I understand it considered an anti-pattern by our DBAs. In part because they are not replication-safe and indeed not (obviously) deterministic. I believe it would be a improvement to change such query to first SELECT the identifiers and then perform the write based by primary key. (Even without batching, although batching might make it even better still.)

I don't that kind of query problematic in itself, in terms of replication safety, unless there was a LIMIT or AUTOINCREMENT column getting manually changed.

IMO we should never encourage use of query(), that's a code smell for probable MySQLisms.

I agree with this and I've been trying to discourage this method for some time.

Anyway, disallowing empty conditions seems fine to me.

@CCicalese_WMF and I think this is atomic enough that we can manage it in Clinic Duty.

daniel renamed this task from Consider disallowing db->update() without condition to Disallow db->update() without condition.Apr 7 2020, 8:54 PM
Peter.ovchyn set the point value for this task to 1.Apr 16 2020, 9:33 AM

Change 592629 had a related patch set uploaded (by Peter.ovchyn; owner: Peter.ovchyn):
[mediawiki/core@master] database: Disallow db->update() without condition

https://gerrit.wikimedia.org/r/592629

In T243619#5834813, @Krinkle wrote:

I agree. My suggestion was for cases where the query is already MySQL-specific. The ability to bypass our abstractions is not unique to query(), one can do this for most parameters to update() and select() as well, e.g. by passing strings instead of descriptive arrays for the value, or conditional etc. If these raw cases already use raw queries, it might be fair to require them to bypass it entirely. The idea being to "make the common thing easy, and the rest possible" (but harder). Another way to accomplish that would be to introduce some kind of "I know what I'm doing" flag, e.g. update( 'things', [ 'key' => 'newval' ], $db::UNCONDITIONAL_UPDATE ) where the constant takes the place of the $conds parameter.

I added assert on $conds argument, but left Database::update accepting '*' as no condition, so that update( 'things', [ 'key' => 'newval' ], '*'); works as expected.

@daniel @Krinkle

  1. Wikibase/client calls the update function in the tests:
$dbw->update( $termCache->getTableName(), [ 'term_search_key' => '' ], [], __METHOD__ );

So, first we have to define how express confirmation that we really want to update all table.
It could be either term like '*' or some constant provided in IDatabase interface like mentioned above: IDatabase::UNCONDITIONAL_UPDATE.

The second, I think, looks even better and more correct as we could provide some unique constant. Note that '*' is used in many places and can't be easily found across the code.

  1. I correct me if I'm wrong, but raising an exception isn't right here, as other stuff could be dependant on current behavior (like Wikibase/Client). I believe that the correct way is to hard deprecate current behavior and in next version change it to exception.

So, In general, the sequence of 3 patches here:

  1. Introduce IDatabase::UNCONDITIONAL_UPDATE constant
  2. Replace existing/known places of usage of unconditional update
  3. Implement param validation and hard deprecate current behavior.

Correct me if I missed something.

So, first we have to define how express confirmation that we really want to update all table.
It could be either term like '*' or some constant provided in IDatabase interface like mentioned above: IDatabase::UNCONDITIONAL_UPDATE.

How about both: create IDatabase::UNCONDITIONAL (for use with both, deletion and update), and make it equal to '*'. We already support '*' to mean "everything", so we need to keep supporting it.

  1. I correct me if I'm wrong, but raising an exception isn't right here, as other stuff could be dependant on current behavior (like Wikibase/Client). I believe that the correct way is to hard deprecate current behavior and in next version change it to exception.

I agree that hard deprecation is the way to go. Of course after fixing any known callers.

So, In general, the sequence of 3 patches here:

  1. Introduce IDatabase::UNCONDITIONAL_UPDATE constant
  2. Replace existing/known places of usage of unconditional update
  3. Implement param validation and hard deprecate current behavior.

Sounds like a good plan!

The only addition I would make is that we should make sure that validation of the condition works exactly the same for update() and delete().

Change 593189 had a related patch set uploaded (by Peter.ovchyn; owner: Peter.ovchyn):
[mediawiki/extensions/Wikibase@master] database: Use IDatabase::ALL_ROWS instead of empty conditions

https://gerrit.wikimedia.org/r/593189

Change 593215 had a related patch set uploaded (by Peter.ovchyn; owner: Peter.ovchyn):
[mediawiki/core@master] database: Disallow db->update() without condition

https://gerrit.wikimedia.org/r/593215

Change 592629 merged by jenkins-bot:
[mediawiki/core@master] database: Introduce IDatabase::ALL_ROWS constant

https://gerrit.wikimedia.org/r/592629

Change 593189 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] database: Use IDatabase::ALL_ROWS instead of empty conditions

https://gerrit.wikimedia.org/r/593189

Change 593215 merged by jenkins-bot:
[mediawiki/core@master] database: Disallow db->update() without condition

https://gerrit.wikimedia.org/r/593215