Page MenuHomePhabricator

Deprecate raw SQL conditions for IDatabase methods (select, insert, etc.)
Open, MediumPublic

Description

There is currently a fairly common use case for which we rely on raw SQL, despite having a pretty good abstraction layer in place.

For example:

$res = $db->select(
    # table(s)
    'revision',
    # select field(s)
    [
        'rev_id',
        'rev_user',
    ],
    # condition(s)
    [
        'rev_page' => $pageId,
        'rev_timestamp > ' . $db->addQuotes( $db->timestamp( $since ) )
    ],
    # descriptive name of caller for logging and debugging
    __METHOD__,
    [ 'ORDER BY' => 'rev_timestamp ASC', 'LIMIT' => 50 ]
);

Goals

  1. Able to achieve the above without using raw SQL.
  2. As consumer, able to guarantee that my use of IDatabase cannot accidentally cause raw SQL to be used (e.g. I want an

"opt-in" or "opt-out" way to say that I promise to never use this insecure method and would expect failure instead of insecurity if something it is accidentally used in my call, no matter of hooks or indirect supply of parts of the conds array.)

Prior art and considerations

We currently have two pieces of a query builder. We have the outer structure of the query created via methods such as $db->select() and $db->insert(). For the inner segments, mainly for the conditions, we also have helper methods such as $db->makeList(), $db->buildConcat(), and $db->buildSelectSubquery().

The idea is to bring these together in an interface for query building, that different backends can implement based on the syntax for that particular RDBMS backend.

The Wikimedia\Rdbms\Subquery class and other encasing classes also resemble the direction of typed value objects in favour of strings, which this would build upon.

The proposed QueryInfo class (see https://gerrit.wikimedia.org/r/459242) also seems relevant here. It's orthogonal in so far that it could co-exist alongside the query builder, but may want to avoid having two very similar concepts of a "query" value object.

Proposal 1: Query builder

I think this may be a good opportunity to separate the query builder from the Database class, and move it to its own class. I envision it may look something like this:

$query = $db->newSelect();
$query-> /*
    fields ...
    from  ...,
    where  ... 
        $query->buildGreaterThan( 'rev_timestamp', $sinceTimestamp )
   etc
*/;
$res = $db->query( $query );

Proposal 2: Alternate

A much simpler approach (in terms of implementation) would be to build on what we have. This would involve the following three changes, instead:

  1. Add the missing utility methods to IDatabase (buildGreatherThan, buildLessThan; most others we have already).
  2. Add some kind of flag to IDatabase::select()and others that would disable support for raw condition strings. For example, an optional first parameter like $db->select( $db::COND_SAFE, 'revision', [ .. ], [ .. ] );
  3. Change all utility methods to return encased value objects, similar to Wikimedia\Rdbms\LikeMatch, and in the selectSqlText method, throw if any of the condition are not formattable arrays and not encased objects.

Below is what that might look like:

$res = $db->select(
   $db::COND_SAFE, # name TBD (strict mode, safe mode, non-raw, ..)
    ...,
    [
        $db->buildGreaterThan( 'rev_timestamp', $sinceTimestamp )
    ],
    ..
);

The downside is that this doesn't make for a good migration target.

The parameter will either always remain optional and rarely used, or if we do decide to deprecate non-safe mode, we'd have to first add it everywhere, and then either keep the pointless option forever, or remove it everywhere again.

One way to avoid that could be to use a different method name, something that doesn't look optional but is just different for no particular reason, e.g. $db->selectQuery() that wouldn't look weird in the long-term, but would however make for two confusingly similar methods until the migration is complete.

One way to avoid that, could be to have the parameters be encapsulated. E.g. we'd keep $db->select() and make the deprecation based on signature. E.g. the legacy signature is the current position arguments, and the new signature could be something like: $db->select( $db->makeSelectQuery( ... ) );. That is basically the query builder idea.

Event Timeline

Fluent

I think in terms of readability, the fluent interface naturally seems appealing for this use case. For example:

$res = $db->select() // returns builder (chainable / fluent)
  ->fields( .. )
  ->from( .. )
  ->where( [
     ..
  ] )
  ->orderBy( .. )
  ->execute();

But, I think we should avoid this, or at the very least contain it slightly better so that we isolate responsibilities. For example. the "Query" object probably shouldn't have an "execute" method. Ideally, Query would be a stateless value object that you'd pass to a $db->execute() method, or something like that.

To allow easy testing and mocking, we may also want to separate "Query builder" from "Query" (value). The artile at https://ocramius.github.io/blog/fluent-interfaces-are-evil/) talks about the problem with mutable value objects, which interestingly enough are not avoided by making the objects immutable (see the article for details, but in a nutshell, as long as there are methods on the same class related to modifying or creating a derivative it makes mocking a mess).

We could use the fluent interface for the building of the query, with a final method that returns separate object that is just an immutable query value with only a constructor and getSql method.

$query = $db->newSelect() // returns query builder
  ->fields( .. )
  ->from( .. )
  ->where( [
     ..
  ] )
  ->orderBy( .. )
  ->getQuery();
$res = $db->execute($query);

We'd address testability and ease of mocking by only passing the Query object around, not the builder.

This is just a quick dump based on something I saw at a conference. Open for all kinds of other ideas :)

I was asked to comment on how considerations from phan-taint-check (static analysis) could help with this effort.

Basically, static analysis will never catch everything, but reducing usage of raw html, and having more things call methods as opposed to string concatenation, as suggested above, make it easier to statically verify that nothing bad is getting put in the wrong place.

In our current DB abstraction layer, often different parts get bundled into an array to be applied later (getQueryInfo()) type methods. The current version of phan-taint-check can't distinguish the taint status of different array elements from each other (arrays are treated as one value). This is arguably a limitation of the specific tool, and not one of static analysis in general, so improvements to the tool may fix this. But as it stands that's a major limitation with using the current static analysis tool with the current db abstraction layer.

Imarlier added subscribers: aaron, Imarlier.

@aaron Would be great to have your input on this

Krinkle triaged this task as Medium priority.Jul 23 2019, 4:40 PM
Krinkle renamed this task from Deprecate raw SQL for IDatabase query conditions (select, insert, etc.) to Deprecate raw SQL conditions for IDatabase methods (select, insert, etc.).Feb 24 2021, 2:51 AM
Krinkle updated the task description. (Show Details)
Krinkle added subscribers: tstarling, daniel.

The $conds parameter currently takes an array (or a string?!). We could make it take a Condition object instead, and deprecate the array. Condition could take an array in the constructor but be stricter about the contents of that array. And it could offer fluent-style setters to construct complex conditions safely. Could be used with SelectQueryBuilder as well, and for join conditions.

Moving to the PET "icebox", since we have no concrete plans to work on this in the foreseeable future. That doesn't mean we think it's a bad idea. This seems nice to have.