Page MenuHomePhabricator

Stop doing regex on SQL in Database
Closed, ResolvedPublic

Description

Currently Database and SQLPlatform classes have horrors such as:

		static $regexes = null;
		if ( $regexes === null ) {
			// Regex with a group for quoted table 0 and a group for quoted tables 1..N
			$qts = '((?:\w+|`\w+`|\'\w+\'|"\w+")(?:\s*,\s*(?:\w+|`\w+`|\'\w+\'|"\w+"))*)';
			// Regex to get query verb, table 0, and tables 1..N
			$regexes = [
				// DML write queries
				"/^(INSERT|REPLACE)\s+(?:\w+\s+)*?INTO\s+$qts/i",
				"/^(UPDATE)(?:\s+OR\s+\w+|\s+IGNORE|\s+ONLY)?\s+$qts/i",
				"/^(DELETE)\s+(?:\w+\s+)*?FROM(?:\s+ONLY)?\s+$qts/i",
				// DDL write queries
				"/^(CREATE)\s+TEMPORARY\s+TABLE(?:\s+IF\s+NOT\s+EXISTS)?\s+$qts/i",
				"/^(DROP)\s+(?:TEMPORARY\s+)?TABLE(?:\s+IF\s+EXISTS)?\s+$qts/i",
				"/^(TRUNCATE)\s+(?:TEMPORARY\s+)?TABLE\s+$qts/i",
				"/^(ALTER)\s+TABLE\s+$qts/i"
			];
		}

		$queryVerb = null;
		$queryTables = [];
		foreach ( $regexes as $regex ) {
			if ( preg_match( $regex, $sql, $m, PREG_UNMATCHED_AS_NULL ) ) {
				$queryVerb = $m[1];
				$allTables = preg_split( '/\s*,\s*/', $m[2] );
				foreach ( $allTables as $quotedTable ) {
					$queryTables[] = trim( $quotedTable, "\"'`" );
				}
				break;
			}
		}

or

		return !preg_match(
			'/^\s*(BEGIN|ROLLBACK|COMMIT|SAVEPOINT|RELEASE|SET|SHOW|EXPLAIN|USE)\b/i',
			$sql
		);

These are extremely hard to understand, prone to errors and mistakes and given that they are run on every query mediawiki makes (and I haven't even listed all regexes), they are quite taxing.

A rather easy solution is to introduce a new class called Query (as a data value) that has query verb in it as an attribute (and e.g. selectSqlText sets the verb properly). We probably have to keep these regexes for a while until we deprecate and remove support for string for ::query() but at least they won't be run for majority of cases, slowly we can start making Query object take query and value and start using prepared statements.

Does that sound like a good idea?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 910756 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] [Very WIP] Introduce Query object

https://gerrit.wikimedia.org/r/910756

Basic idea seems good. It will take a while, but I'd love to get rid of those regexes eventually. The Database CRUD wrapper methods already bypass them with $flags, but something like a Query class could solve the case for query().

Basic idea seems good. It will take a while, but I'd love to get rid of those regexes eventually. The Database CRUD wrapper methods already bypass them with $flags, but something like a Query class could solve the case for query().

Yes and no. Yes. $flags bypasses the case of isWriteQuery() but $flags doesn't do anything in the following cases:

  • getQueryVerb(). Happens on every query. Multiple times.
  • getTempTableWrites() ran on every write query. I despise this one the most and by a large margin. That's the first regex at the top of this ticket.
  • makeCommentedSql() on every query.
  • TrxProfiler::getGeneralizedSql() on every slow query and every write query.

Probably there are more. Some could be fixed with Query object, some probably not.

Oh, Query could definitely improve those things, except maybe makeCommentedSql(). Is that method actually slow? The rest could be Query fields easily provided by the CRUD wrappers.

I'm a bit weary of using prepared statements due to extra queries, statefulness (what happens on connection loss?) and handling of x=? when ? is sometimes NULL. They only seem useful for heavily repeated queries that can just be batched.

AFAIK, and grepping around, we don't use temp tables with DB_REPLICA at WMF (we used to via SMW when it was on wikitech). If we treated temp table creation and writes as "permanent writes" nothing bad would happen. Various schema migration scripts uses __temp* tables, but those are writing to the primary anyway.

We use bare strings a lot with sql.php and schema scripts, so I don't think it's possible to remove string support queries. I think we can migrate in steps, something like:

  • Create query() support Query instances and make existing CRUD methods use them. If given a string, have query() make the Query instance using regexes.
  • Update duplicateTableStructure() to use Query and deprecate QUERY_PSEUDO_PERMANENT.
  • Warn when query() is used without a Query instance. Make sql.php avoid warnings by blindly creating a Query object with empty info except the raw query.
  • Remove the regexes for "DML write queries" and treat string queries as writes based only on the query verb.
  • Treat tables as "temporary" only if a Query object had a "CREATE TEMPORARY" verb and that table set. This includes isPristineTemporaryTable() too.
  • Remove the regexes for "DDL write queries". Temp tables have to use the prefix or be seen as permanent writes (disallowed with DB_REPLICA role handles).

Oh, Query could definitely improve those things, except maybe makeCommentedSql(). Is that method actually slow?

It's not really about speed/perf. It's about readability/maintainability and resilience. If the regex is buggy, it can eat into the SQL or cause a large scale outage by making invalid SQLs. Plus it's hard to understand to debug.

I'm a bit weary of using prepared statements due to extra queries, statefulness (what happens on connection loss?) and handling of x=? when ? is sometimes NULL. They only seem useful for heavily repeated queries that can just be batched.

You need to see where is the trade-off. Prepared statements are powerful for security (against SLQ injection), the perf regression might be worth it it might not. We need to run tests and make sure before deciding for or against it. The work I'm mentioning here is that it's worth doing regardless since you can resue the prepared statement directly in logging (in trx profiler/db error) and getting rid of reverse engineering of values in SQL. So it's worth going in that direction even if we don't end up using it.

AFAIK, and grepping around, we don't use temp tables with DB_REPLICA at WMF (we used to via SMW when it was on wikitech). If we treated temp table creation and writes as "permanent writes" nothing bad would happen. Various schema migration scripts uses __temp* tables, but those are writing to the primary anyway.

I'm not following this, the temp write section only gets triggered if isWriteQuery() is true which should not happen in replicas.

I'm slowly doing the work on this in the patch in this ticket.

Change 910756 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Introduce Query object

https://gerrit.wikimedia.org/r/910756

Ladsgroup added a project: DBA.

I'm doing the clean up. Mostly done now!

Change 913641 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Switch calls to ::query() to use Query object in DatabaseMysqlBase

https://gerrit.wikimedia.org/r/913641

Change 913644 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Avoid running GeneralizedSql::stringify()

https://gerrit.wikimedia.org/r/913644

Change 913641 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Switch calls to ::query() to use Query object in DatabaseMysqlBase

https://gerrit.wikimedia.org/r/913641

Change 913644 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Avoid running GeneralizedSql::stringify()

https://gerrit.wikimedia.org/r/913644

AFAIK, and grepping around, we don't use temp tables with DB_REPLICA at WMF (we used to via SMW when it was on wikitech). If we treated temp table creation and writes as "permanent writes" nothing bad would happen. Various schema migration scripts uses __temp* tables, but those are writing to the primary anyway.

I'm not following this, the temp write section only gets triggered if isWriteQuery() is true which should not happen in replicas.

I'm slowly doing the work on this in the patch in this ticket.

CREATE/INSERT triggers isWriteQuery(), but we only error out with DB_REPLICA if it is a "permanent" write (e.g. not temp table creation nor a write to a temp table). SMW* initially broke until those temp table exceptions were added (requiring regexes).

Removing the DML regexes from QueryBuilderFromRawSql is the easiest next step since temp tables writes should use the insert() wrapper. We don't have a create() wrapper though, so getting rid of the DDL regexes will be trickier. I don't know if it's worth making create(), built of the DBAL stuff, just for that. Maybe just requiring explicit Query arguments to query() is the easiest route for "CREATE TEMPORARY TABLE", though that goes against the idea of Query being internal.

A more opinionated approach is to not support temp tables with DB_REPLICA. Instead, things like subqueries and common table expressions would be used. The use of temp tables on regular requests involves extra rountrips, state (which confounds silent reconnection), and is often engine-specific. I don't know exactly what SMW needs temp tables for, but it could probably be changed. We should probably talk to them about it though.

Maybe the simplest way to not break SMW is:

  • Add a QUERY_TEMP_ONLY flag that means "trust that this write is to a temp table"
  • Allow passing it for insert()/InsertQueryBuilder calls
  • Update release notes, announce, and give time for callers to be migrated to use this flag

Change 914281 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Stop supporting MySQL in SQLite via regex

https://gerrit.wikimedia.org/r/914281

Maybe the simplest way to not break SMW is:

  • Add a QUERY_TEMP_ONLY flag that means "trust that this write is to a temp table"
  • Allow passing it for insert()/InsertQueryBuilder calls
  • Update release notes, announce, and give time for callers to be migrated to use this flag

I'm happy with that solution but I'm weary of flags in general. They are not easy to understand to a newcomer, they are not well-defined (what are we going to flag and what we don't?), they don't form a MECE and prone to mistakes by using the wrong ones. Now that we have an object that can have multiple attributes, we can go with several ones, e.g. an attribute on if it's affecting temp tables.

I really want to rework how flags work inside rdbms (query and db object as well)

Change 916458 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Introduce a proper generalized SQL instead of running regex on string

https://gerrit.wikimedia.org/r/916458

Change 914281 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Stop supporting MySQL in SQLite via regex

https://gerrit.wikimedia.org/r/914281

Maybe the simplest way to not break SMW is:

  • Add a QUERY_TEMP_ONLY flag that means "trust that this write is to a temp table"
  • Allow passing it for insert()/InsertQueryBuilder calls
  • Update release notes, announce, and give time for callers to be migrated to use this flag

I'm happy with that solution but I'm weary of flags in general. They are not easy to understand to a newcomer, they are not well-defined (what are we going to flag and what we don't?), they don't form a MECE and prone to mistakes by using the wrong ones. Now that we have an object that can have multiple attributes, we can go with several ones, e.g. an attribute on if it's affecting temp tables.

I really want to rework how flags work inside rdbms (query and db object as well)

What do you think of the idea of requiring DB_REPLICA temp tables to have an obvious prefix? e.g. a tempTableName() method that adds a prefix like __tmp_session_*? Any such tables could be except from the "is permanent write" checks. This would avoid adding a new QUERY_* flag and would only require changes to methods already doing raw CREATE TEMPORARY query() calls.

Change 920277 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Clean and comment SQLs in delete during build of the SQL

https://gerrit.wikimedia.org/r/920277

Change 920277 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Build generalized SQL while building the main SQL in DELETE

https://gerrit.wikimedia.org/r/920277

For delete queries, it now works well without doing regex to build the normalized query: https://logstash.wikimedia.org/goto/17af87981006469b62f5ade3d91d9765

I'll make patches for the rest of clean up later this week.

Change 932017 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Build generalized SQL while building the main SQL in UPDATE

https://gerrit.wikimedia.org/r/932017

Change 932017 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Build generalized SQL while building the main SQL in UPDATE

https://gerrit.wikimedia.org/r/932017

Change 942511 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Build clean SQL in SQLPlatfrom instead of regex for INSERTs

https://gerrit.wikimedia.org/r/942511

Change 942511 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Build clean SQL in SQLPlatfrom instead of regex for INSERTs

https://gerrit.wikimedia.org/r/942511

Change 944935 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Add tests for building insert SQL

https://gerrit.wikimedia.org/r/944935

Change 944935 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Add tests for building insert SQL

https://gerrit.wikimedia.org/r/944935

Change 947017 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] [POC] [DNM] [WIP] rdbms: Build clean SELECT SQL alongside the main one

https://gerrit.wikimedia.org/r/947017

Change 947017 abandoned by Ladsgroup:

[mediawiki/core@master] [POC] [DNM] [WIP] rdbms: Build clean SELECT SQL alongside the main one

Reason:

https://gerrit.wikimedia.org/r/947017

Ladsgroup moved this task from In progress to Done on the DBA board.

Any code path that would have run regex on majority or even significant proportion of SQLs are now migrated to pass around the information instead of reverse engineering it via regexes afterwards. All regexes have been moved to dedicated class to improve readability of the main codebase.

Another one that can be done is doing clean regex by default for SELECT SQLs, it is much more complicated than it looks due to DatabaseMysqlBase::selectSQLText() being weird (see the patch for details). But given that no regex will be triggered unless the query errors or be slow, it's not really worth the time trying to figure out what to do with DatabaseMysqlBase::selectSQLText() without breaking the max exec time functionality. We want to eventually get back to fixing this as I dream that one day we would use prepared statements.

Change 916458 abandoned by Krinkle:

[mediawiki/core@master] Introduce a proper generalized SQL instead of running regex on string

Reason:

Was since merged and squashed into Ic8820fd993f142507870573.

https://gerrit.wikimedia.org/r/916458