Page MenuHomePhabricator

RFC: Discourage use of MySQL's ENUM type
Open, MediumPublic

Description

  • Affected components: MediaWiki core and WMF-maintained extensions.
  • Policy steward: DBA
  • Stakeholders:
    • DBA
    • TechCom
    • Maintainers of core components and WMF-deployed extensions.

Motivation

The MediaWiki database schema contains several ENUM columns (see below in Current uses).

Issues:

  1. ENUM columns can make keyset pagination difficult, as illustrated in the Oct 2015 example below.
  2. Not all DBMSes support ENUMs.
  3. Changing an ENUM requires a schema change, which slows down development.
Requirements
  • Agreement that ENUM() is rarely needed in WMF-deployed MediaWiki database schemas.
  • Document some of the things it is currently used for and what we recommend instead.
  • Document some of the typical use cases for it in the industry, and what we recommend instead.
  • Decide whether to allow it at all for new schemas, and if so what the process for that should be.

Exploration

Oct 2015 example

Below is from an enwiki DB dump from October 2015:

1MariaDB [enwiki_subset]> select cl_type, cl_sortkey, cl_from from categorylinks where cl_to = 'A.F.C._Emley' order by cl_type, cl_sortkey, cl_from\G
2*************************** 1. row ***************************
3 cl_type: page
4cl_sortkey:
5A.F.C. EMLEY
6 cl_from: 5664196
7*************************** 2. row ***************************
8 cl_type: subcat
9cl_sortkey: PLAYERS
10A.F.C. EMLEY PLAYERS
11 cl_from: 18585943
12*************************** 3. row ***************************
13 cl_type: file
14cl_sortkey: AFCEMLEY.PNG
15 cl_from: 19575733
163 rows in set (0.00 sec)
17
18MariaDB [enwiki_subset]> select cl_type, cl_sortkey, cl_from from categorylinks where cl_to = 'A.F.C._Emley' order by cl_type, cl_sortkey, cl_from limit 1\G
19*************************** 1. row ***************************
20 cl_type: page
21cl_sortkey:
22A.F.C. EMLEY
23 cl_from: 5664196
241 row in set (0.00 sec)
25
26MariaDB [enwiki_subset]> select cl_type, cl_sortkey, cl_from from categorylinks where cl_to = 'A.F.C._Emley' and cl_type > 'page' order by cl_type, cl_sortkey, cl_from limit 1\G
27*************************** 1. row ***************************
28 cl_type: subcat
29cl_sortkey: PLAYERS
30A.F.C. EMLEY PLAYERS
31 cl_from: 18585943
321 row in set (0.00 sec)
33
34MariaDB [enwiki_subset]> select cl_type, cl_sortkey, cl_from from categorylinks where cl_to = 'A.F.C._Emley' and cl_type > 'subcat' order by cl_type, cl_sortkey, cl_from limit 1\G
35Empty set (0.00 sec)
36
37MariaDB [enwiki_subset]> select cl_type, cl_sortkey, cl_from from categorylinks where cl_to = 'A.F.C._Emley' and cl_type > 'file' order by cl_type, cl_sortkey, cl_from\G
38*************************** 1. row ***************************
39 cl_type: page
40cl_sortkey:
41A.F.C. EMLEY
42 cl_from: 5664196
43*************************** 2. row ***************************
44 cl_type: subcat
45cl_sortkey: PLAYERS
46A.F.C. EMLEY PLAYERS
47 cl_from: 18585943
482 rows in set (0.00 sec)

The comparison in the WHERE clause apparently is using the string value, yet the ORDER BY clause is using the internal ID, so the category link from the file page gets missed. In fact, ApiQueryCategoryMembers has a workaround for this involving multiple queries. This was noticed as part of T58041. The change that likely caused that issue not only severely hurt the performance of the script, but also its correctness.

Current uses

In core, there are eight ENUM columns in five different tables (see tables.sql):

  • categorylinks.cl_type
  • image.img_media_type
  • image.img_major_mime
  • oldimage.oi_media_type
  • oldimage.oi_major_mime
  • filearchive.fa_media_type
  • filearchive.fa_major_mime
  • uploadstash.us_media_type.

There are also some ENUM columns in extension tables:

  • ArticleFeedbackv5: aft_feedback.aft_discuss
  • CentralAuth: globaluser.gu_enabled_method, localuser.lu_attached_method, wikiset.ws_type, renameuser_status.ru_status, renameuser_queue.rq_status,
  • CentralNotice: cn_notice_log.notlog_action, cn_template_log.tmplog_action
  • ContentTranslation: cx_translations.translation_status
  • OAI: updates.up_action
  • SacredText: sacredtext_verses.st_religious_text
  • WikiLog: wikilog_comments.wlc_status

I'm not sure whether changing the data types of any of these columns (whether in core or extensions) could ever happen given the backlog of schema changes at WMF, though it's worth considering separately from this RFC.

Alternatives

Indexed columns could use tinyint, varbinary, or binary(1) (or more appropriate type).

Encourage normalization of schemas, with justification needed for exceptions.

Virtual columns.

Event Timeline

PleaseStand raised the priority of this task from to Needs Triage.
PleaseStand updated the task description. (Show Details)
PleaseStand added a project: Wikimedia-Rdbms.
PleaseStand added subscribers: PleaseStand, aaron.
Krinkle added a subscriber: Krinkle.

Tagging DBA to make a decision on whether or not this is a Bad Thing (TM), then resourcing/migration can follow or be declined.

Note this comment was when a different context and scope was given to the task. I mention to resolve it as invalid when it had the name "Avoid MySQL's ENUM type, which makes keyset pagination difficult" AND was not an RFC.

I think it is wrong to blankly ban enum usage. Sure, ORDER BY uses its internal id, but that is a widely known feature of MySQL enums, that in certain cases may be desired (e.g. custom ordering, not necessarily alphabetical). Other arrangements, such as numeric alternatives would only have a strinc numeric ordering, while enums could be ordered in any way it was desired (including alphabetic order). Please know I understand that that would be undesirable or surprising in certain cases, in which- sure, request the alter if they simplify code, but giving a general direction valid everywhere seems wrong.

In any case, IF a general direction would have to be given, it should be to avoid enums and replace with a foreign key (column reference) to another table where results are normalized. This will allow to modify the allowed properties without later doing extra schema changes, and similar storage requirements. Note that would NOT fix the main issue indicated here, which is the internal numeric ordering.

Two additional things to point here: if the enum was defined in lexicographical order, the issue would not be happening. In which case, the suggestion should be "create/maintain ENUM definitions in lexicographical order" (of course, that will be more maintenance-heavy). The other thing is that string ordering wouldn't be able to use the index (as it actually orders by its enum order). This, I think, is not relevant for the given query, but it would be important to not create bad queries in other cases.

So, in summary, while I understand the original request, I would only suggest to impose such a restriction on an individual case. Certainly, enums are not a good candidate for primary methods of indexing anyway (in mysql), but I think they are ok for certain properties, more expressive than arbitrary numbers or configuration that can be lost/not obvious. I am also ok with, in a case by case bases, when code can be simplified and avoid performance issues and missunderstanding/bugs, to substitute them -but not by tinyints, but whatever is the right type for references to extra tables, where the string value is conserved in a normalized schema. When we are on higher versions of MySQL/MariaDB, virtual columns may also be helpful here, for both indexing speedup (virtual indexes) and legibility- aka having enum-like strings on numeric columns, allowing all methods of ordering.

I suggest to close this as rejected (in spirit, as a general rule, with no judgement on individual cases which could still be filled, assuming they are advantageous on code). I would, however add a statement to encourage normalization of schemas (which enums generally are not), and to justify denormalization for new tables with explicit performance or other reasons.

Krinkle triaged this task as Medium priority.May 5 2020, 2:34 AM

@jcrespo Would if we said "using ENUM is discouraged. Use plain integers with mapping elsewhere is preferred (e.g. application constants or mapping table). If you think using ENUM is needed to achieve your functional or perf requirements, get DBA approval first."

Is that more or less accurate, or too strong? (wording to be improved etc.)

Independently of the "strength", I think it could be missunderstood, the same way now many people think "all primary keys should be autoincremental integers" instead of "if there is no good options for a PK, just add a new autoinc".

What about something like:

ENUMs maybe be useful in some particular cases, but please consider as a first option setting up a normalized schema, where strings that get repeated over many rows of a table are actually stored on a separate table, and just referred as identifiers. If you feel that enums are a best fit for your use case consider getting general consensus from several other mediawiki maintainers and developers".

I don't think "DBAs must approve" is a good way to go about this. I consider DBAs as part of mediawiki maintainers/mw community, I just don't think we should be the only gatekeepers (feel free to rewrite that in any other way).

Of course feel free to translate into proper English.

Krinkle renamed this task from Avoid MySQL's ENUM type, which makes keyset pagination difficult to RFC: Discourage use of MySQL's ENUM type.May 21 2020, 12:36 AM
Krinkle removed Krinkle as the assignee of this task.
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)
Krinkle moved this task from P1: Define to P3: Explore on the TechCom-RFC board.

While it doesn't necessarily address the overall use of enum, discussion above regarding encouraging normalization tables led me to a proposal to remove 7 of the 8 uses in core: add a media_types normalization table to replace 4 of the enums, and a major_mimes table for 3 more[1]

Replace all 4 *_media_type ENUMs with *_media_type_id keys, and the same for the major_mimes[2]
New id fields are smallint unsigned NOT NULL instead of ENUM, addressing the original concern of this task, while also reducing duplication of all of the values

Thoughts?

[1] Draft schema (structure based on https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema#Name_Tables)

CREATE TABLE /*_*/media_types (
	mt_id smallint PRIMARY KEY AUTO_INCREMENT
	mt_name varbinary(64) NOT NULL
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/mt_name ON /*_*/media_types (mt_name);

CREATE TABLE /*_*/major_mimes (
	mm_id smallint PRIMARY KEY AUTO_INCREMENT
	mm_name varbinary(64) NOT NULL
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/mm_name ON /*_*/major_mimes (mm_name);

[2] Though the *_major_mime fields all have a default of unknown which would need to be replaced with a default id value corresponding to a specific row of the major_mimes table (unless we had the default be 0 with the ids starting at 1)

I honestly support banning using enum altogether. @jcrespo's point of view is more on the DBA side. Let me share my developer point of view:

  • It has been brought up here but I need to emphasize it, Having ENUM is a big development hurdle, every change requires a schema change which is really really painful (I'm trying to address it in abstract schema and schema changes). A simple change can become ten times more complicated because of this. OTOH, When I changed rc_patrolled to have three values instead of two (unpatrolled, patrolled, auto-patrolled) it was super easy because the field was tiny int type and not enum, I just used a higher number.
  • Not all DBMSes support ENUM, for example sqlite that we officially support doesn't have enum datatype and the only way is to use "constraint" (https://stackoverflow.com/questions/5299267/how-to-create-enum-type-in-sqlite). This is a hack and if we want abstract schema and schema changes, this is going to require crazy gymnastics to support. They are basically drifting and not the same.
  • Currently doctrine DBAL doesn't support ENUM because it doesn't exist in all DBMSes. Either I have to find a solution, re-invent the wheel because of this data type, or stop using ENUM. For other reasons, I personally prefer the third option.

@Ladsgroup I think what you wrote above should be in the task description as well, especially the part that it is not supported by all DBMS.

Also, the example of broken pagination is not clear to me. It's just a wall of text. Would be helpful if it could be down in a clearer way.

  • Not all DBMSes support ENUM, for example sqlite that we officially support doesn't have enum datatype and the only way is to use "constraint" […]

We currently have ENUM in our core schema, e.g. categorylinks and image, afaik we don't have a separate schema for sqlite. So what do we do for these today? What are the caveats with/impact of that today?

  • Not all DBMSes support ENUM, for example sqlite that we officially support doesn't have enum datatype and the only way is to use "constraint" […]

We currently have ENUM in our core schema, e.g. categorylinks and image, afaik we don't have a separate schema for sqlite. So what do we do for these today? What are the caveats with/impact of that today?

Glad you asked. Sqlite currently takes the mysql DDL and replaces keywords using regex: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/includes/libs/rdbms/database/DatabaseSqlite.php#894

Besides all of the horrors it has (like one letter variable names), it just replaces ENUM with TEXT, meaning no constraint on the data that can be put there + performance/storage nightmare.

NameTableStore, added in 1.31, provides a convenient way of handling pseudo-enums from the PHP side. (Alternatively, you can just use numeric IDs.) It's a little less self-explanatory on the DB side, but that's definitely a good tradeoff for avoiding the need for schema changes with every new value.

NameTableStore, added in 1.31, provides a convenient way of handling pseudo-enums from the PHP side.

As far as I can see, only core code can use it – it doesn't have any extension points for extensions.

NameTableStore itself can be used just fine, it doesn't need any extension points. NameTableStoreFactory can only be used by core (that should be fixed IMO), but you only really need it for having multiple instances for the same table (ie. when you need do cross-wiki DB operations).

NameTableStore, added in 1.31, provides a convenient way of handling pseudo-enums from the PHP side.

As far as I can see, only core code can use it – it doesn't have any extension points for extensions.

The new term store of wikibase uses it for one of its tables: https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/lib/includes/Store/Sql/Terms/DatabaseTypeIdsStore.php

Clarification: I think we need to decide whether we need another table or hardcoded numbers in the code on case-by-case basis but we should not use enum for any of those.

NameTableStore, added in 1.31, provides a convenient way of handling pseudo-enums from the PHP side.

As far as I can see, only core code can use it – it doesn't have any extension points for extensions.

The new term store of wikibase uses it for one of its tables: https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/lib/includes/Store/Sql/Terms/DatabaseTypeIdsStore.php

I see, you are constructing the class directly. I wonder how this is going to work with the new stable interface policy, since if it doesn't get annotated with @newable for the 1.35 release, by strict reading of the policy, there is no guarantee it won't be changed in the future without deprecation.

I see, you are constructing the class directly. I wonder how this is going to work with the new stable interface policy, since if it doesn't get annotated with @newable for the 1.35 release, by strict reading of the policy, there is no guarantee it won't be changed in the future without deprecation.

IIRC the class was used before stable interface policy gets approved and enforced.

Yeah, we also used it in WikimediaEditorTasks and MachineVision pre-policy. But ultimately, I think, the factory should be fixed. Filed T254242: NameTableStoreFactory should be extensible about that.

I see, you are constructing the class directly. I wonder how this is going to work with the new stable interface policy, since if it doesn't get annotated with @newable for the 1.35 release, by strict reading of the policy, there is no guarantee it won't be changed in the future without deprecation.

The idea is to mark the relevant classes before 1.35 is released, see https://www.mediawiki.org/wiki/Core_Platform_Team/Initiative/Stability_annotations

According to SQL Antipatterns using an ENUM is an antipattern unless the values are set in stone. It seems like the better solution is to use a separate lookup table.

LSobanski added a subscriber: LSobanski.

Unsubscribing DBA as there is no action for us. Please add us back if you need any further assistance.

@LSobanski Actually, I would like to hear DBAs opinion on this. What they feel about using ENUM in maintenance of the databases.

Ladsgroup: I already gave an opinion at T119173#6064823 (please note that that was in context of the original proposal of the task "ban ENUMs", not the context of the current task/RFC. That and subsequent comments already capture the essence of the my suggestions (discourage them and encourage table normalization). Of course, maybe those DBAs acting on schema changes may have additional thoughts.

However, as long as RFC is now, I believe that wouldn't affect DBAs and platform engineering except if the result if to actually remove current, old usages, which could be a multi-year project and not worth it?

However, as long as RFC is now, I believe that wouldn't affect DBAs and platform engineering except if the result if to actually remove current, old usages, which could be a multi-year project and not worth it?

Given Doctrine doesn't support ENUM, I was actually planning to replace simple ones but OTOH we can get away with using custom data type which is less than ideal but it would work so *shrugs*

Please what's the status of this and how to move it forward? I guess this will currently block T230428 since Doctrine doesn't know this type.

Please what's the status of this and how to move it forward? I guess this will currently block T230428 since Doctrine doesn't know this type.

You can define a custom data type (similar to mwtimestamp) and accept the values as customOptions (similar to mwtimestamp again) for now but we really should avoid using in the future. Also, IIRC only four tables in core have ENUM and fixing those requires lots of work (which we should do eventually but not as part of abstracting the tables IMHO)

You can define a custom data type (similar to mwtimestamp) and accept the values as customOptions (similar to mwtimestamp again) for now but we really should avoid using in the future. Also, IIRC only four tables in core have ENUM and fixing those requires lots of work (which we should do eventually but not as part of abstracting the tables IMHO)

Yes, I was thinking that has to be done eventually, but also wondered whether this can move forward, it has been 5 years so far...

For this to move forward, a concrete wording for a new policy needs to be proposed. It doesn't have to be big, just a couple of sentences to be added to
https://www.mediawiki.org/wiki/Manual:Coding_conventions/Database.

It seems to me like the rough consensus is that yes, we want to discourage the use of ENUM, but we do not want to outright prohibit it, at least for core. But ENUMs should only be used if they are clearly superior to the available alternatives (using strings, integers, or normalized references to a name table).

It seems to me like the move towards abstract schema declaration would prevent the use of ENUMs naturally, or require the definition of a custom type to allow ENUMs to be used on some backends. This would not be possible for extensions, unless we create a hook for defining custom types. Which we probably only want to do if there is a strong and clear need.

Is that a fair summary?

Is that a fair summary?

Yes, two extra notes:

  • I personally support banning it outright (and slowly migrating away current usecases, in core only four tables use it). Since I yet to see a case that ENUM is clearly superior than other options (maybe I'm missing something obvious, it's easy to convince me).
  • Sqlite has no support ENUM and we are officially supporting Sqlite, meaning all tables having this are actually using simple TEXT instead which defies the point of ENUM in the first place (performance, proper constraints on the data against corruption)

For this to move forward, a concrete wording for a new policy needs to be proposed. It doesn't have to be big, just a couple of sentences to be added to
https://www.mediawiki.org/wiki/Manual:Coding_conventions/Database.
[…]

I think you may've meant https://www.mediawiki.org/wiki/MediaWiki_database_policy per last week's meeting?

I think you may've meant https://www.mediawiki.org/wiki/MediaWiki_database_policy per last week's meeting?

Oh, hmm... perhaps these two should be merged? It's not entirely clear to me where this would fir better.

Funnily enough those two are highly outdated and needs to change to reflect the abstract schema and schema changes RFC. We also have https://www.mediawiki.org/wiki/Manual:Schema_changes. Our database documentation needs an overhaul in general (with the json schema, we can also generate some of those documentation automatically)

My policy suggestion: "ENUM datatype is not supported in all database engines and also requires schema change for changing its accetable values, so using it should be avoided as much as possible. If it's really needed, the developer must show it's clearly superior to its alternative solutions such as normalization tables or hard-coding values in the code"

Does that sound good? Edit mercilessly, my writing is terrible.

Is the documentation merging / overhaul tracked anywhere (in a task, perhaps)? Seems like a good opportunity to ask :)

Is the documentation merging / overhaul tracked anywhere (in a task, perhaps)? Seems like a good opportunity to ask :)

Not yet but it's good to have and do any way.

I know it's a bit off-topic but since I have your attention I wanted to say that with migrating tables to abstract schema now it's easy to automatically generate some of the documentation. I quickly wrote something and now it generated this from the tables.json.

I know it's a bit off-topic but since I have your attention I wanted to say that with migrating tables to abstract schema now it's easy to automatically generate some of the documentation. I quickly wrote something and now it generated this from the tables.json.

cool (though there appears to be a bug - for the indexes all of the indexes for a table all give the same name)