Page MenuHomePhabricator

[PageAssessments] db/addProjectsTable.sql : Error: 1071 Specified key was too long; max key length is 767 bytes
Closed, DeclinedPublic

Description

When running MySQL/MariaDB with charset utf8mb4, a character is 4 bytes. Hence a key created on a VARCHAR(255) takes 255 characters * 4 bytes/characters = 1020 bytes. However Innodb defaults to a maximum of 767 bytes for an index.

The extension fails to install on Debian Stretch which uses utf8mb4 as the default charset:

Query: CREATE UNIQUE INDEX pap_project_title ON  page_assessments_projects (pap_project_title)

Function: Wikimedia\Rdbms\Database::sourceFile( /workspace/src/extensions/PageAssessments/src/../db/addProjectsTable.sql )
Error: 1071 Specified key was too long; max key length is 767 bytes (localhost:/tmp/quibble-mysql-m3atn8f1/socket)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@hashar Can you provide some more context on this ticket?

Please don't take any action yet, while the bug is real, maybe it should be solved on core, not with a schema change.

Vvjjkkii renamed this task from [PageAssessments] db/addProjectsTable.sql : Error: 1071 Specified key was too long; max key length is 767 bytes to xudaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Mainframe98 renamed this task from xudaaaaaaa to [PageAssessments] db/addProjectsTable.sql : Error: 1071 Specified key was too long; max key length is 767 bytes.Jul 1 2018, 11:45 AM
Mainframe98 raised the priority of this task from High to Needs Triage.
Mainframe98 updated the task description. (Show Details)
Mainframe98 added a subscriber: Aklapper.

@jcrespo - Any update on this? Should we go ahead and fix it with a schema change?

FYI, it looks like this bug affects at least 33 other extensions: https://phabricator.wikimedia.org/search/query/.toXYVsB2ZdB/. Some have already added schema changes to work around the problem.

So the thing is InnoDB DOES support longer indexes than 767. By default, on the latest versions of MariaDB (10.2 and up) and MySQL (8.0 and up), it supports up to 3072 bytes, and on supported versions before those, it supports that if using Barracuda file format, innodb_file_per_table and innodb_large_prefix = ON.

The thing is that it seems (as an external observation) mediawiki has decided to drop 'utf8' support (UTF-8, 3 byte implementation) and use only the binary collation moving forward, in which the storage is given in bytes, not characters: See T194125 closed as resolved.

Based on that, I am unsure how "extensions" should react to it- a char(10) (storing 10 characters) will get automatically interpreted as a char(10) binary (storing 10 bytes) if using a binary collation, with no schema change (for new tables/installs) but it is unclear to me if extensions should hardcode binary anyway. I can answer the technical aspects of this (db consequences), but of course I cannot advice mediawiki-wise. For example, should you support transitions to binary for utf8 installations? (idk).

What I definitely suggest is NOT to use utf8 (3 byte implementation of UTF-8).

So the thing is InnoDB DOES support longer indexes than 767. By default, on the latest versions of MariaDB (10.2 and up) and MySQL (8.0 and up), it supports up to 3072 bytes, and on supported versions before those, it supports that if using Barracuda file format, innodb_file_per_table and innodb_large_prefix = ON.

@jcrespo - That's good to know. One thing I don't really understand is what is the actual severity of this bug? For example, if we want to enable PageAssessments on Russian Wikipedia (T184967), which doesn't have these tables yet, would this bug block it? Or are we using the config options you mention above? In other words, does this bug only affect 3rd party users, or also WMF?

Only 3rd party users (not WMF production). However, CI was affected once,
not sure how that ended up.

767 maps well with "utf8" limited to 255 characters coded on 3 bytes (total is 765, plus 2 additional bytes... for null trailing?), i.e. 256*3-1 (excluding an additional lead byte)
With "utf8mb4", the limit would then be 256*4-1=1023, we don't need 3072, but InnoDB should then be tuned to effectively support "utf8mb4" that Mediawiki chose.


TL;DR (sorry):

With other SQL backends that have native internal support for Unicode collations, VARCHAR(255) would remain measured in characters, not bytes, and the engine will compute the relevant size limits in bytes for their indices (note that on most other SQL engines, indices and normal tables are the same thing, they are coallesced all to the internal structure of indices, the main table being organized itself as a "clustered" index whose rows start by the indexed columns along with a padding non-indexed area containing the other column data, or using an external storage for these extra columns, themselves organized as an clustered index for some leading columns, and a padding area containing other non-indexed columns and a rowid pointing to the master index/table: this saves lot of storage space as allows multiple clustering on the same logical table).

Clustered indices are supported in Oracle, Sybase, SQL Server, Informix, PostgreSQL and many others (even "smaller" engines like dBase, or various RDF stores, or Berkeley DB). MySQL itself (as well as Oracle) allows several backends for the table/index storage with different organizations (with or without support of transactions and concurrent accesses, or two-phase commits, or distributed stores and virtual stores with an API, such as software registries for various OSes/distributions, local filesystems, mounted remote filesystems like NFS, or REST APIs with web HTTP/HTTPS-based protocols or proxies where the data will be physically located on one or more nodes possibly working in P2P protocols, including blockchains for security and cooperative verification of ordered transactions).

However the DDL proposed does not include any way to tune the backing stores, it just models the "surface", and there's a wide variety of backing stores; some of them are integrated in the SQL dialect with variable syntax, some will require installing a dialect translator service or a connector service, and possibly a transaction manager or coordinator, and the time to commit may be very variable: the model could be optimistic, or could require a full multiphase-commit and extensive works and long delays for rollbacks and I'm not sure that Mediawiki is really tuned to allow long commits (not without using asynchronous requests everywhere, i.e. by extending the "job queue" and providing callbacks that will inform the frontend later, including the final users: this has severe consequences in terms of usability if users can't see their edits validated and kept, at least in the history of changes that would need to trace the automated rollbacks, and in terms of cachability of the delivered contents on the web and in client's browsers, if any commit requires dozens of minutes to be visible, at least in a temporary non fully validated state, but as we've seen in SemanticWiki, the delays can be very long and with some bugs these delays can exceed weeks, months or years while mane pages will be locked for undetermined time, possibly in a really harmful state if this state contain abusive contents).

So can we really support multiple backends with a single abstract DDL only? I have doubts. A lot of existing MediAiwki API would have to be changed to work only at this abstract level and prohibit any direct use of SQL, or the abstract layer should invent and maintain its own SQL dialect (and it will be channenginf to maintain, because we will also need to develop "connectors" or "translators". It is possible and in fact possible as long as we stand with a pure relational model, but we already have other non-relational models (including models where some data is replicated and needs to be replicated for performance reasons, even if this requires a maintenance task to maintain the coherence or recompute aggregated data, e.g. in a statistical MetaCube or an RDF store, with lots of custom scripts in various languages, notably for P2P-distribured backing stores and complex transactional models, even if they are tuned for the optimistic transaction mode, because of very costly and lengthy rollbacks and their impact on the live system in terms of locks).

And we already have challenging problems with multibase systems (e.g. coordination across wikis inside the WMF projects) and new ones will appear soon for Abstract MediaWiki where it will be difficult to use a single merge history of events coming from many "coordinated" data sources (a single "pageid" would not be enough, we would need a "sourceid/pageid" reference model for all data, and possibly "sourceid/transactionid/pageid" for coordination; where as a simple "rowid" will be insufficient, including in indices: the abstract DDL is then probably a dream for now and not all our new abstract datamodel will support the abstract DDL without maintaining a separate set of adapters/converters/coordinators with different capabilities, different delays and different transactional models.)

The issue was that MediaWiki did not define a default charset it thus used whatever the MySQL server has. On Debian Jessie it used to be Latin1, on Debian Stretch that got changed to utf8mb4.

Eventually that got addressed by https://gerrit.wikimedia.org/r/c/mediawiki/core/+/461279 which now defaults to a binary charset:

--- a/includes/DefaultSettings.php
+++ b/includes/DefaultSettings.php
@@ -1931,7 +1931,7 @@ $wgDBprefix = '';
 /**
  * MySQL table options to use during installation or update
  */
-$wgDBTableOptions = 'ENGINE=InnoDB';
+$wgDBTableOptions = 'ENGINE=InnoDB, DEFAULT CHARSET=binary';

See T193222#4247329 and following comments for more details.