|mediawiki/extensions/Wikibase : master||Fix truncated terms on the fly|
|Open||None||T142691 [Bug] wb_terms table truncates labels exceeding 255 bytes, possibly leaving invalid UTF-8|
|Resolved||thiemowmde||T142820 [Task] Decide on quick fix for "Max length of reference label in russian"|
- Mentioned In
- T162562: Choose a field length for entity id columns and stick to it
T225805: DatabaseTermIdsAcquirer fails on terms longer than 255 bytes
T208425: [EPIC] Kill the wb_terms table
T145932: [Bug] Special:NewItem does not validate maximum label length
T142820: [Task] Decide on quick fix for "Max length of reference label in russian"
- Mentioned Here
- T198866: Uses of wb_terms SQL table to be migrated away from
T75087: Efficient entity label lookup (tracking)
T86530: Replace wb_terms table with more specialized mechanisms for terms (tracking)
T99459: ips_site_page is too short to store some (full) page titles
T125500: [Epic] Index Wikidata labels and descriptions as separate fields in ElasticSearch
There are multiple things to consider here here:
- The label, description and alias fields are limited to 250 characters (see multilang-limits in Wikibase.default.php). Same for monolingual terms. But string values are limited to 400 characters. I'm not absolutely sure all this is as intended.
- We are currently using the term_text column from the wb_terms table to display these labels. It's set to varbinary(255). But:
- There are multiple problems with touching the term_text field:
- There is no absolute maximum byte length a 250 characters UTF-8 string can have. A sequence of emoji may even exceed 1000 bytes.
- [Solution: Increase length] but does MySQL still support all indexing features when we increase the length?
- [Solution: Use VARCHAR] But does MySQL still support all indexing features when we change the field from VARBINARY to VARCHAR?
- [Solution: Fallback to retrieving lookup] A possible workaround is to check the length of the strings returned by the EntityInfoTermLookup. If it's exactly 255 characters we repeat the lookup a second time with the EntityRetrievingTermLookup. (Or to add this fallback behavior directly to the EntityInfoTermLookup) The "retrieving" lookup was the one we used for all label lookups a few months ago. We started using the term table based lookup for performance reasons, to avoid deserializing hundreds of entity blobs just because we need a few labels.
- [Solution: Elastic lookup] Long term solution should be to "drop" the terms table (read: do not use it any more) in favor of a proper elastic based index (see T125500: [Epic] Index Wikidata labels and descriptions as separate fields in ElasticSearch).
We have 2133 affected terms:
MariaDB [wikidatawiki_p]> SELECT COUNT(*) FROM wb_terms WHERE LENGTH( term_text) > 254; +----------+ | COUNT(*) | +----------+ | 2133 | +----------+ 1 row in set (8 min 15.88 sec)
Some quick notes:
In repo/sql/Wikibase.sql, this is declared as term_text VARCHAR(255) BINARY NOT NULL, which is NOT the same as VARBINARY. VARCHAR(255) BINARY should use binary collation for UTF8 data, any would allow 255 unicode characters to be stored. The field was apparently changed to VARBINARY during deployment, possibly for backwards compat with MySQL4.
How ever we work around the issue that this field is VARBINARY(255) on the live system, we should strip any broken UTF8 from the end of the string, using StringNormalizer::removeBadCharLast.
It's VARBINARY in both my local database and on the live system (we can see it truncates at 255 bytes). I don't know how this happened. I assume the meaning of VARCHAR BINARY changed. It appears this was an alias for VARBINARY in the MySQL versions we use.
Here is an example search: https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%D0%92%D1%8B%D1%81%D0%BE%D1%87%D0%B0%D0%B9%D1%88%D0%B8%D0%B9+%D0%BC%D0%B0%D0%BD%D0%B8%D1%84%D0%B5%D1%81%D1%82+1+%D0%BC%D0%B0%D1%80%D1%82%D0%B0+1881+%D0%B3%D0%BE%D0%B4&format=json&language=ru&uselang=ru&type=item. Note that the search result ends with \ufffd. This should be stripped, as @daniel suggests.
During story time we found that:
- Certain closed subtasks of T75087: Efficient entity label lookup (tracking) are probably the reason for this bug.
- The remaining open subtasks of T75087 will solve this specific bug.
- A much wider parent task T86530: Replace wb_terms table with more specialized mechanisms for terms (tracking) also exists, outlining a series of individual replacements necessary to get rid of the terms table.
We said that we will, for now, solve this bug by implementing the "fallback" workaround described in T142691#2543325.