Two new secondary database tables are proposed:
- wbl_lemmas stores the lemma (a text value) for each Lexeme.
- wbl_item_references (or similar) stores two Item references (lexical category and language) for each Lexeme.
- For the current use case both tables need a primary index on the Lexeme ID.
- A fulltext index on the Lemma column is not needed for the current use case, but it is suggested to design the table so one can be added any time later.
- Indexes on the two Item reference columns are probably never needed. Backreferences from the Items to Lexemes that use these Items will be done via MediaWiki core's link table, and Wikibase usage tracking.
- The numbers of rows in the proposed wbl_item_references table is going to be identical to the total number of Lexemes. According https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model there is only one Item reference for the lexical category, and only one Item reference for the language. This makes the table significantly different from wb_terms, where each Item can have labels and descriptions in 300+ languages, and users are able to enter as many aliases as they want.
- A Lexeme can have multiple lemmas, one per language code. (Note: The exact set of language codes is currently unspecified, but is most probably going to be limited to a configured set of a few hundred.) While similar, lemmas are still distinctively different from Item labels. For Item labels users are expected to enter as many as they can, while for lemmas users will only enter multiple lemmas if the Lexeme requires it. Example: The American "color" and British "colour" are supposed to be modeled as lemmas on a single Lexeme. A best-case approximation are two lemmas per Lexeme, a worst-case approximation are ten lemmas per Lexeme.
- Item IDs are currently limited to 32 bits (signed). The maximum ID is Q2147483647, which is 11 characters.
- Some day we might need to switch to 64 bits (still signed). The maximum is then Q9223372036854775807. That's 20 characters.
- Since we know we are exclusively dealing with Item references, but no other entity types, we could store the references as integers. Is this worth it with regards to performance? Or is an indexed VARBINARY column as efficient as an indexed INT column?
Suggestion is to go with VARBINARY(20) for both columns.
Suggestion is to not reserve space for repository prefixes. Even on a multi-repository setting the Item references used for lexical categories and languages should always come from a single repository.
Lemma column length
- In contract to Item labels, lemmas are (mostly, but not exclusively) single words only.
- One of the longest words in an English dictionary is "Supercalifragilisticexpialidocious" (34 characters). Some other candidates for a longest word in other languages are about 70 characters long, see http://mentalfloss.com/article/50611/longest-word-in-the-world.
- Some chemicals have names with tens of thousands, even hundreds of thousands of characters. What such extreme examples basically mean is: Whatever limit we choose, it will be arbitrary. There will always be exceptions. We must always think about truncation.
- Rendering lemmas with thousands of characters untruncated in contexts that reference the Lexeme (but are not meant to represent the Lexeme like the Lexeme page itself does) certainly does not make sense. When a lemma is used in the visible text or tooltip of a link, some truncation must happen. Otherwise a single link would span multiple lines or even paragraphs. A trivial truncation algorithm that prooved to be sufficent many times (e.g. in TwoColConflict) is to hard truncate via the database, and apply a CSS ellipsis to hide the hard truncation.
- If we want to make sure MySQL can index all characters in a VARCHAR column, we should not go beyond VARCHAR(768). See https://phabricator.wikimedia.org/T154660#2936497 for a very closely related discussion.
- Labels, descriptions, and aliases are currently limited in two ways: to 250 Unicode characters via a setting "multilang-limits", as well as 255 bytes via a VARCHAR(255) in the wb_terms table.
- If we make sure we are able to expand the table structure later, we could start with VARCHAR(255), and later expand to VARCHAR(768) or further if needed.
I talked to PM (@Lydia_Pintscher) and we established the limit should not be 255 bytes, but 768.