Page MenuHomePhabricator

A query builder for MediaWiki core
Closed, ResolvedPublic

Description

As part of my review of data access abstractions in MediaWiki core, I'm introducing a query builder module. The idea is to encapsulate the parameters to IDatabase::select(), as seen in the getQueryInfo() pattern, into an object.

The class has chainable mutator methods, similar to those in Doctrine's QueryBuilder. Unlike in Doctrine, to help with migration from existing code, it is necessary to allow callers to directly set the underlying arrays. Certain implementation details are exposed.

The query builder object does not require an IDatabase for its construction, only for its execution. This gives it a similar state to the getQueryInfo() arrays, which eases migration. The disadvantage of this approach is that it is difficult to add an expression builder. Ideally, the arrays passed to IDatabase::select() would be DBMS-independent, including not having any quoted strings in them. For example, instead of IDatabase::buildLike(), which returns a DBMS-dependent string, you would have a LikeExpression value object which would be resolved during the execution of IDatabase::select(). But I think that's a project for another day. For now, the syntax will be $queryBuilder->where( 'user_name' . $db->buildLike(...) ).

The query builder's join methods, which are aimed at new code, require a modern style in which all joins have aliases and join conditions. Doctrine's join interface is very similar. Instead of

$tables[] = 'user';
$join_conds['user'] = [ 'JOIN', 'rev_user=user_id' ];

We would have

$builder->join( 'user', 'user1', 'rev_user=user_id' );

I'm adding structured support for parenthesized join expressions and joining on a subquery.

Doctrine uses the same QueryBuilder class for select, update and delete queries, which I don't think is advisable since the allowed methods are different. I'm adding a newable SelectQueryBuilder; we could later add UpdateQueryBuilder and DeleteQueryBuilder.

Instead of a single execute() function, I'm planning to have wrappers for select(), selectField(), selectFieldValues(), selectRow() and selectSQLText(), potentially called fetchResultSet(), fetchField(), fetchFieldValues(), fetchRow() and getSQL() respectively.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 17 2020, 4:25 AM
Tgr added a subscriber: Tgr.Jan 17 2020, 8:07 AM

T203712: Formalize getQueryInfo() usage tried to ease migration by implementing ArrayAccess.

I note that getQueryInfo() has been fairly successful because it defines the what (tables, fields, join conditions) without trying to specify the how (WHERE conditions, options), meaning that the code actually handling the how can easily mix-and-match and add special cases as needed. In the past I've had some concern that trying to replace it with an actual query builder would lose this property, and we'd wind up with code that's too inflexible to serve the various use cases.

On the other hand, use of a query builder (as implemented by ApiQueryBase) for the code actually implementing the how for specific use cases has generally been a success in the Action API's query modules. It seems to have been less of a success in special page Pager implementations, which I attribute to a design that makes too many assumptions about the queries (particularly in how limits and ordering works) and so winds up being too inflexible for some use cases.

We could always have an interface that returns a partially filled SelectQueryBuilder, with empty conditions. I'm not sure if it's worth worrying about right now. The low hanging fruit is one-off queries that aren't part of any larger query building system, and direct callers of Database::select() like QueryPage::reallyDoQuery(). I mentioned getQueryInfo() because it's a good analogy, I'm not planning to get rid of it immediately. The point of introducing SelectQueryBuilder is to split a useful concept out of ApiQueryBase so that it can be used in pure backends.

A goal I have is to try to put table options close to the tables they serve. So far, I have formal parameters:

$queryBuilder->table( $tableName, $alias, $useIndex );

Doctrine does the same. But to reach feature parity with Database::select(), we also need IGNORE INDEX, and a few other options would need to be added if we wanted to expose all MySQL features. Having lots of formal parameters doesn't seem to take advantage of the fluent style. How about

$queryBuilder ->table( $tableName, $alias )->useIndex( $index )->ignoreIndex( $otherIndex );

Where useIndex() and ignoreIndex() act on the most recently declared table? It's implicit but brief, and you could be creative with line breaks to help readability. Note that there is also $queryBuilder->options() as a last resort array hacking function, so it's never necessary to parse an options array and convert it to a call chain.

How about

$queryBuilder ->table( $tableName, $alias )->useIndex( $index )->ignoreIndex( $otherIndex );

Where useIndex() and ignoreIndex() act on the most recently declared table? It's implicit but brief, and you could be creative with line breaks to help readability. Note that there is also $queryBuilder->options() as a last resort array hacking function, so it's never necessary to parse an options array and convert it to a call chain.

I like the "fluent" style for builders in general, for something like $builder->select(...)->from( ... )->where( ... ). In fact, because of this, I'd prefer from over tables (or maybe support both). Making this stateful, using implicit context, has some potential for confusion though... Would it be terrible to have from( $table, $alias, $options ) as a signature? They are rather uncommon, right? To me it would make more sense to have dedicated methods for group, sort, and limit, rather than USE INDEX and IGNORE INDEX.

By the way, we currently support USE INDEX and IGNORE INDEX as on option that applies to the entire query, right? I think it would be fine to just continue doing this. Or do we need to be able to apply these to individual joins or grouping operations?

I like the "fluent" style for builders in general, for something like $builder->select(...)->from( ... )->where( ... ). In fact, because of this, I'd prefer from over tables (or maybe support both).

I believe fluent interfaces often use aliases so that you can choose whatever terminology best suits the call site. I already have two aliases for where(): andWhere(), by analogy with Doctrine, and conds(), which was a late addition but it makes the code so much more obvious when the same terminology is retained throughout, and currently this concept is almost always called conds. So I can certainly add a from() alias. Note that there is both table(), which adds a single table, and tables(), which adds an array of tables, but I'm discouraging the latter for new code since it makes more sense to use join(). I imagine from() would be an alias for table().

Making this stateful, using implicit context, has some potential for confusion though... Would it be terrible to have from( $table, $alias, $options ) as a signature? They are rather uncommon, right? To me it would make more sense to have dedicated methods for group, sort, and limit, rather than USE INDEX and IGNORE INDEX.

USE INDEX appears in 21 files in core, excluding the rdbms library and tests. That's pretty common. IGNORE INDEX appears in 5 files. I'll consider your suggestion.

By the way, we currently support USE INDEX and IGNORE INDEX as on option that applies to the entire query, right? I think it would be fine to just continue doing this. Or do we need to be able to apply these to individual joins or grouping operations?

In terms of the SQL query grammar, index hints apply to individual tables, there's no concept of an index to use for an entire query. If $options['USE INDEX'] is a string, it is simply appended to the FROM clause, meaning it will apply to the last table in the FROM list.

Is it easier to accept a context-sensitive useIndex() function if you consider it to be effectively appending to a query string?

How about accepting either an array or a string? If a string is given, it is applied to the most recent table, whereas if an array is given, it is merged with the raw option array meaning that the aliases will be given in the array keys.

Change 566126 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] In Database::select() allow an empty array for $table

https://gerrit.wikimedia.org/r/566126

Change 566127 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Add SelectQueryBuilder

https://gerrit.wikimedia.org/r/566127

Change 566128 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Use a SelectQueryBuilder in SpecialWhatLinksHere

https://gerrit.wikimedia.org/r/566128

Is it easier to accept a context-sensitive useIndex() function if you consider it to be effectively appending to a query string?

Yea, it's not so terrible. It's a builder afterall, things being sensitive to the order of calls isn't so surprising.

How about accepting either an array or a string? If a string is given, it is applied to the most recent table, whereas if an array is given, it is merged with the raw option array meaning that the aliases will be given in the array keys.

I like the flexibility, but I'm afraid it's more confusing...

We could always have an interface that returns a partially filled SelectQueryBuilder, with empty conditions. I'm not sure if it's worth worrying about right now.

If the plan isn't to replace getQueryInfo(), we can leave this point until then. When we do come back to it, the issue IMO is more in "combining" than "filling".

If the plan isn't to replace getQueryInfo(), we can leave this point until then. When we do come back to it, the issue IMO is more in "combining" than "filling".

I'd like to understand better what you mean. Do you have an example where returning a pre-filled SelectQueryBuilder instead of a QueryInfo array would make it difficult to construct the question? Seeing this in code would make it more clear to me.

For a getQueryInfo() array you can generally just array_merge(), and you can manipulate the array in other ways (e.g. reordering tables because the query wants to use the STRAIGHT_JOIN hint, like 9e871e05b73). For objects with internal state, such things need explicit support in the object, or you need to access the "internal" state which often doesn't really seem intended to be accessed in that way.

tstarling added a comment.EditedJan 23 2020, 10:54 PM

In a meeting, @daniel convinced me that there's not much to gain by making SelectQueryBuilder be a value object as opposed to having IDatabase act as a factory. It's likely that anything that constructs a SelectQueryBuilder will also need access to an IDatabase, at least for addQuotes(). If IDatabase is a factory, then SelectQueryBuilder can potentially be extended in future with expression builder functions that depend on addQuotes() etc.

Maybe we could have a way to change the database a SelectQueryBuilder is attached to, so that if there is a module boundary between SelectQueryBuilder construction and execution, the caller of the execution function can still be in control of the connection parameters.

I'm not convinced that an intra-module circular dependency is of sufficient concern to warrant splitting IDatabase, i.e. having IDatabase extend from an interface that only provides the query execution functions, so that the SelectQueryBuilder constructor can type hint an interface that does not include the SelectQueryBuilder factory function. That's additional complexity for everyone who wants to find the documentation or use IDatabase.

Note that it will be necessary to manually break the circular reference in SelectQueryBuilder::__destruct() so that DBConnRef will work as expected.

See also:

I see two good opportunities for splitting the IDatabase interface:

  1. […]
  1. Separate logic for database interaction from logic for building sql text.

This is a proposal for a possible next refactor, which would reduce IDatabase to only the state management of the connection and the logic needed to send and interpret queries.

The stateless logic for building sql text (e.g. "query builder") could be separated. The main reason I'm proposing this is so that we could group together all the stateless utility methods for rdbms-backend specific SQL text manipulation. As well as in context of deprecating raw SQL chunks in queries (see T210206).

Have you considered the trade-off between keeping all logic specific to one backend together vs having the query builder be a standalone child object consumed by (and not aware of) the IDatabase class? This could still be done later, so it's not mutually exclusive with your current approach. Just curious about your thoughts on this. I'm feel a bit torn between them myself actually. Could use a second opinion :)

Change 570517 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] In ApiQueryBase, use a SelectQueryBuilder to store query information

https://gerrit.wikimedia.org/r/570517

Change 566126 merged by jenkins-bot:
[mediawiki/core@master] In Database::select() allow an empty array for $table

https://gerrit.wikimedia.org/r/566126

Change 566127 merged by jenkins-bot:
[mediawiki/core@master] Add SelectQueryBuilder

https://gerrit.wikimedia.org/r/566127

daniel assigned this task to tstarling.Apr 7 2020, 12:48 PM
daniel triaged this task as Medium priority.Apr 7 2020, 12:50 PM

Change 566128 merged by jenkins-bot:
[mediawiki/core@master] Use a SelectQueryBuilder in SpecialWhatLinksHere

https://gerrit.wikimedia.org/r/566128

Change 570517 merged by jenkins-bot:
[mediawiki/core@master] In ApiQueryBase, use a SelectQueryBuilder to store query information

https://gerrit.wikimedia.org/r/570517

tstarling closed this task as Resolved.May 22 2020, 1:13 AM
Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:41 PM