Page MenuHomePhabricator

Changing the PostgreSQL MediaWiki schema to use BYTEA where MySQL uses VARBINARY / VARCHAR BINARY would break SQL queries
Open, LowPublic

Description

When a field is varbinary in mysql (e.g. actor_name), the postgres version to be TEXT and not binary (BYTEA) because INSERT SELECT from other columns that are string break, or insert with integer values (while it doesn't break with TEXT type). For now, these columns are diverging.

Maybe it needs to do proper casting before inserting anything and then we can fix it.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

There are more details in the discussion on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/595289/, but basically identifier fields like username need to use a deterministic collation (meaning that two strings are only equal if they are bytewise equal), otherwise all kinds of security issues might arise. Common MySQL text types with non-binary collations are not deterministic (test), so one needs to use either VARBINARY or VARCHAR BINARY (which are not too different, especially given that on a wiki with default $wgDBTableOptions the default charset is binary so both will become VARBINARY in practice). There might be other RDBMS-es with a similar situation, so for an abstract schema a binary type is the right representation.

PostgreSQL's TEXT type on the other hand uses deterministic collation (Postgres 12 introduced nondeterministic collations but they require some upfront configuration) so it's fine to use that. In theory BYTEA would be fine too, but most of the existing Postgres table definitions use TEXT, and the switch is very disruptive (all SQL commands which compare a TEXT field with a BYTEA one or assign the value of one to the another break - example, example).

So either we need to add explicit cast to a bunch of places or switch all tables which interact with each other at the same time.

Tgr renamed this task from Postgres breaks with BYTEA type to Changing the PostgreSQL MediaWiki schema to use BYTEA where MySQL uses VARBINARY / VARCHAR BINARY would break SQL queries.Jul 12 2020, 7:38 PM
BPirkle subscribed.

Please retag if there's work for CPT on this.

Please retag if there's work for CPT on this.

Yes, it's core handling different databases, how it's not related to platform?

I actually found another case of this, it even has a todo in tables.sql of Postgers:

-- TODO: Change CHAR/SMALLINT to BOOL (still used in a non-bool fashion in PHP code)

When I set a datatype to boolean, it turns to TINYINT(1) in Mysql which is expected but in Postgres, it turns to boolean because it supports it but tests fail with this:

01:01:49 7) MediaWiki\ParamValidator\TypeDef\TagsDefTest::testValidCovers
01:01:49 Wikimedia\Rdbms\DBQueryError: Error 42804: ERROR:  column "ctd_user_defined" is of type boolean but expression is of type integer
01:01:49 LINE 1: ... "unittest_change_tag_def" SET ctd_user_defined = 1 WHERE (c...
01:01:49                                                              ^
01:01:49 HINT:  You will need to rewrite or cast the expression.
01:01:49
AMooney moved this task from Inbox to Tech Planning Review on the Platform Engineering board.

After informal discussion with a few folks in Platform-Engineering, I think what we'll need to move this along at all is the following

  • @daniel Please add a comment with a brain dump of the considerations that came to mind when you were looking at this task, there were a number of things you mentioned. It doesn't have to be a pretty or nice list.
  • We will want a DBA in the loop.
  • We need someone at this org who knows PostgreSQL well.
  • Then we should sit down and go through the list of concerns and figure out a plan.

My gut reaction after the discussion and reading this is that we'll want to choose the "switch all tables" option (see comment T257755#6299419 ) but that can turn out to be totally wrong.

Any BOOL field in the PostgreSQL schema should be altered to be an integer, for the same reasons as the rest of the schema consistency changes I recommended in T164898. That's an easy one, unless there's some detail I'm missing.

As for the rest of it, yes I agree that MySQL's VARBINARY fields should not be BYTEA in PostgreSQL. On T164898 the only field I recommended for BYTEA was old_text. BYTEA is only really needed when the text is gzip-compressed or otherwise contains control characters, TEXT is good enough for UTF-8. In MySQL we use VARBINARY for valid UTF-8 text. Maybe the name caused some confusion -- varbinary is just an 8-bit clean varchar, which we use to work around problems with MySQL's unicode support.

There's no need for casting if you just use the right type for actor_name, right?

Moving to tracking/watching. Per Tim's comment, there is a clear recommendation now. I'm assuming Amir is going to work on it, and we can lend a hand i need be. If this is not the case, we'll have to think about how to prioritize this and route it through the team boards.

Thank you. What @tstarling recommended is what we are doing currently and nothing would change except handling old_text. It's not super easy to have an exception in abstract schema but not too complicated either (similar to what we did for timestamp fields). I will make a patch about it soon. Also, I want to apologize for the task description, it's barely understandable. I assume I was really tired when I wrote it.