Page MenuHomePhabricator

Clean up changed/temp table tracking in Database/Query and document table name expectations
Open, In Progress, MediumPublic

Description

Problem:
The use of table name strings in rdbms lacks a consistent, documented, convention on quoting (e.g. backticks), prefixing (e.g. $wgDBprefix), and hierarchy qualification (e.g. $wgDBname and $wgDBmwschema). This includes method parameters and keys/values in temp table tracking arrays. This makes it hard to tell when the same table is referenced twice, since there are many ways to refer to the same table. The tableName() method is not reliably reversible, so we cannot just aggressively normalize all table arguments (and it would add overhead/complexity even if we could). We also do not track that database in which each temp table resides on.

Impact:
We have fragility in CI since integration tests need *reliable* table use tracking in order to reset tables between tests (usually via TRUNCATE). Bugs can occur when the same table is referenced with different strings (e.g. "a" and "currentdb.a") or with a string having unexpected quoting/qualification (e.g prefix_a instead of "a" or "currentdb.a" instead of "a"). Historically, these bugs manifest as confusing CI failures (e.g. AuthManager tests failing due to an actor_id values being too high or DatabaseBlockTest failing with duplicate key errors).

The use of "<db>.table" by callers to rdbms methods is idiosyncratic and makes it less clear from the code how the tables and databases form "datasets" (related data that can can be cross-referenced in single queries and must atomically change/replicate together). Getting a handle for database X only to query database Y requires a more complex mental model for developers.

Querying remote database or JOINing across databases can easily result in needless presumptions about what databases reside together on the rdbms servers, which should be left up to DBAs to decide. At Wikimedia, the core databases are grouped into clusters s1 through s8 in order to scale the amount of data and traffic, meaning there is no single server that contains all of these databases. Local testing environments will not likely catch wrong presumptions of database colocation. Tightly related data should generally just use a global database, rather than be spread across multiple databases only to be awkwardly queried with UNION/JOIN/subqueries. For cross-database maintenance scripts, such queries are usually a premature optimization anyway; they can easily use multiple/iterative single-database queries instead.

One bug that can arise is the incorrect throwing of a "read-only" exception upon a write to a temp table from a DB_REPLICA handle. This can happen when using "CREATE TEMPORARY TABLE a" with query() and then calling select() on table "currentdb.prefix_a" (instead of just table "a"). Wikimedia no longer uses temp tables for production queries the undeployment of SemanticMediaWiki, so this bug does not effect us.

Another bug can arise when the selected database changes. Tables can incorrectly be seen as temp tables and thus (a) writable by DB_REPLICA handles even if permanent/pseudo-permanent and (b) existing when they do not, via tableExists(). This bug is rare and mostly effects tests that use the legacy MediaWikiIntegrationTestCase::db field (T316841) while triggering a downstream DBConnRef for a different database.

Solution
We should simplify and clarify the expectations around DB table name strings in the Database, SQLPlatform, and the basic core *QueryBuilder classes. Specifically:

  • Document the expected format of all table name arguments to methods. The vast majority require tables names to have no quotes, no table prefix, nor any db/schema qualifiers. The arguments for tables targeted for writes must be table names. The arguments for tables targeted for reads are either table names or, if derived tables are allowed, table references. Using "unqualified table name" or "unqualified table reference" seems apt enough for quick doc block comments. Put detailed documentation in IDatabase::select() that can be referenced from elsewhere.
  • Fix any methods that expect different table name formats and update the callers. There are only a few and they are used for internal things or some obscure bits in DatabaseInstaller/DatabaseUpdater.
  • Warn/log when qualified table names are provided in the wrong places.
  • In Database, track which database each temp table belongs to.
  • Properly handle and document the oddities around temp table schemas in sqlite/postgres vs mysql.

Background:
The table names in the Query class are mostly used for two things:
a) For tracking what session temp tables exists, which effects whether a query counts a permanent write or not. This is used for logging and safety checks. Accommodations are made for the TEMPORARY tables used to mock permanent tables during testing, so they are treated similarly by these checks.
b) For supplying names to ChangedTablesTracker. This is used for tracking what tables changing during integration tests, so that we can reset them. These tables are usually though not always TEMPORARY.

For the session temp table tracking, we use a field in Database called "sessionTempTables", keyed by table name. Such table names should to be fully qualified before any simple equality comparisons. Currently, registerTempTables() and hasPermanentTable() just use tableName(...,'raw'), which resolves table prefixes/$wgSharedTables. There are a few problems here:

  • This does not qualify the table with the db name. Thus, it does not account for the fact temp tables on different dbs can have the same name in mysql.
  • We use schemas in postgres, but temp tables always go in the pg_temp schema, which makes it hard to use fully qualified table names in "sessionTempTables". Determining when to use db.pg_temp.table vs db.schema.table would be messy. Technically, sqlite uses a similar TEMP schema for temp table too, but doesn't practically matter for us unless ATTACH is used (e.g. $wgSharedTables).
  • These methods compare "raw" tables names with those from Query::getTable(), yet the corresponding "new Query()" callers do not prefix/qualify the table name. This means that, normally, a magic "base" table name (no prefix/$wgSharedTables applied) will be compared to the tableName(...,'raw') keys in "sessionTempTables", which does not make sense.

It isn't clear how much ChangedTablesTracker::getTables( $domainId ) is supposed to be for "changed tables belonging to this domain" or "table changed while the current domain was this domain", though it seems to mostly do the later. $wgSharedTables muddles the waters too much for the first case. In the second case, it's still odd that a table from $domainId could be changed while the current domain was $otherDomain, which would then not show in ChangedTablesTracker::getTables( $domainId ). This would only happen with something like $dbw->insert( "foriegn_db.table", ... ), which shouldn't really happen, but isn't proscribed by a comments nor safety checks.

It isn't documented whether ChangedTablesTracker is supposed to work with "base" table names (no prefix/$wgSharedTables applied yet). That seems to be the expectation though. For example, MediaWikiIntegrationTestCase::resetDB() checking against the string "user". When MediaWikiIntegrationTestCase::truncateTables() runs on those entries, this works as long as all callers of the RDBMS query wrapper methods provided "base" table names. The duplicateTableStructure() methods are special case, using a dot prefix to treat the table name as full (bypass prefix/$wgSharedTables).

If we want ChangedTablesTracker to use "base" table names, then pretty much all of the $table arguments to RDBMs methods should be documented as requiring "base" names and made to refer to some common documentation of the concept. This would still run into the problem of queries doing cross-table joins, which involve using remote DB table names (e.g. "db2.table"). SQLPlatform::tableName() will assumed that such tables are already prefixed and $wgSharedTables was resolved. Getting "base" names from qualified names is not possible (similar to "is this string encoded?" questions). This case would really only matter for cross-DB JOINs, which are rarely a thing in MediaWiki. It would otherwise be silly to get a db1 handle to do "SELECT ... FROM d2.table" and even more so for writes. The write methods could actually be made to disallow table names with "db.table" in them, and these are the ones that matter with regard to temp/write table tracking. QueryBuilderFromRawSql should be removed, with every query() using Query. This would avoid having to guess "base" table names from table names that are possibly prefixed, qualified or transformed by $wgSharedTable.

We should document correct use, log misuse, fix misuse, and eventually enforce against misuse. There is already more than enough accidental complexity around table name handling and unit test table reset logic (setup/teardown). It's become a time sink on various occasions. The goal is to reduce developer confusion and prevent the risk of mistakes propagating, which can be difficult to fix and become self-perpetuating: it's hard to add enforcement against bad assumptions later on if it breaks stuff <=> lack of enforcement makes it easy to add bad assumptions.

Event Timeline

Change #976368 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: improve Query/QueryBuilderFromRawSql handling of table tracking

https://gerrit.wikimedia.org/r/976368

aaron triaged this task as Medium priority.
aaron moved this task from Incoming (Needs Triage) to In Progress on the MW-Interfaces-Team board.
FJoseph-WMF changed the task status from Open to In Progress.Apr 5 2024, 2:43 PM

Change #1017422 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: document where unprefixed/unqualified table are expected

https://gerrit.wikimedia.org/r/1017422

Change #1019137 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: track session temp tables by DB name in Database

https://gerrit.wikimedia.org/r/1019137

Change #1017422 merged by jenkins-bot:

[mediawiki/core@master] rdbms: document where unprefixed/unqualified table names are expected

https://gerrit.wikimedia.org/r/1017422

Change #976368 merged by jenkins-bot:

[mediawiki/core@master] rdbms: use qualifiedTableComponents() in more places

https://gerrit.wikimedia.org/r/976368

Change #1032568 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: rename and clarify various table name parameters

https://gerrit.wikimedia.org/r/1032568

Change #1034425 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: clean up indexExists() and indexUnique() in Database

https://gerrit.wikimedia.org/r/1034425

Change #1032568 merged by jenkins-bot:

[mediawiki/core@master] rdbms: rename and clarify various table name parameters

https://gerrit.wikimedia.org/r/1032568