Change Details

== Table size * WIP == Item references * WIP == Lemma column length * The numbers of rows in the proposed tables is going to be identical to the total number of Lexemes. There is only one lemma per Lexeme (in only one language), only one Item reference for the lexical category, and only one Item reference for the language. This makes the two database tables proposed here significantly different from wb_terms, where each Item can have labels and descriptions in 300+ languages, and users are able to enter as many aliases as they want. * In contract to Item labels, the lemma of a Lexeme is (by definition) a single word only. * One of the longest words in an English dictionary is "Supercalifragilisticexpialidocious" (34 characters). * Some chemicals have names with tens of thousands, even hundreds of thousands of characters. What such extreme examples basically mean is: Whatever limit we choose, it will be arbitrary. There will always be exceptions. We must always think about truncation. * Rendering lemmas with thousands of characters untruncated in contexts that reference the Lexeme (but are not meant to represent the Lexeme like the Lexeme page itself does) certainly does not make sense. When a lemma is used in the visible text or tooltip of a link, some truncation must happen. Otherwise a single link would span multiple lines or even paragraphs. A trivial truncation algorithm that prooved to be sufficent many times (e.g. in TwoColConflict) is to hard truncate via the database, and apply a CSS ellipsis to hide the hard truncation. * If we want to make sure MySQL can index all characters in a VARCHAR column, we should not go beyond VARCHAR(768). See https://phabricator.wikimedia.org/T154660#2936497 for a very closely related discussion. * Labels, descriptions, and aliases are currently limited in two ways: to 250 Unicode characters via a setting "multilang-limits", as well as 255 bytes via a VARCHAR(255) in the wb_terms table. * If we make sure we are able to expand the table structure later, we could start with VARCHAR(255), and later expand to VARCHAR(768) or further if needed. I talked to PM (@Lydia_Pintscher) and we established the limit should **not** be 255, but 768.

Two new secondary database tables are proposed: * wbl_lemmas stores the lemma (a text value) for each Lexeme. * wbl_item_references (or similar) stores two Item references (lexical category and language) for each Lexeme. == Indexes * For the current use case both tables need a primary index on the Lexeme ID. * A fulltext index on the Lemma column is not needed for the current use case, but it is suggested to design the table so one can be added any time later. * Indexes on the two Item reference columns are probably never needed. Backreferences from the Items to Lexemes that use these Items will be done via MediaWiki core's link table, and Wikibase usage tracking. == Table size * The numbers of rows in the proposed tables is going to be identical to the total number of Lexemes. There is only one lemma per Lexeme (in only one language), only one Item reference for the lexical category, and only one Item reference for the language. This makes the two database tables proposed here significantly different from wb_terms, where each Item can have labels and descriptions in 300+ languages, and users are able to enter as many aliases as they want. == Item references * Item IDs are currently limited to 32 bits (signed). The maximum ID is Q2147483647, which is 11 characters. * Some day we might need to switch to 64 bits (still signed). The maximum is then Q9223372036854775807. That's 20 characters. * Since we know we are exclusively dealing with Item references, but no other entity types, we could store the references as integers. Is this worth it with regards to performance? Or is an indexed VARCHAR column as efficient as an indexed INT column? Suggestion is to go with VARCHAR(20) for both columns. == Lemma column length * In contract to Item labels, the lemma of a Lexeme is (by definition) a single word only. * One of the longest words in an English dictionary is "Supercalifragilisticexpialidocious" (34 characters). * Some chemicals have names with tens of thousands, even hundreds of thousands of characters. What such extreme examples basically mean is: Whatever limit we choose, it will be arbitrary. There will always be exceptions. We must always think about truncation. * Rendering lemmas with thousands of characters untruncated in contexts that reference the Lexeme (but are not meant to represent the Lexeme like the Lexeme page itself does) certainly does not make sense. When a lemma is used in the visible text or tooltip of a link, some truncation must happen. Otherwise a single link would span multiple lines or even paragraphs. A trivial truncation algorithm that prooved to be sufficent many times (e.g. in TwoColConflict) is to hard truncate via the database, and apply a CSS ellipsis to hide the hard truncation. * If we want to make sure MySQL can index all characters in a VARCHAR column, we should not go beyond VARCHAR(768). See https://phabricator.wikimedia.org/T154660#2936497 for a very closely related discussion. * Labels, descriptions, and aliases are currently limited in two ways: to 250 Unicode characters via a setting "multilang-limits", as well as 255 bytes via a VARCHAR(255) in the wb_terms table. * If we make sure we are able to expand the table structure later, we could start with VARCHAR(255), and later expand to VARCHAR(768) or further if needed. I talked to PM (@Lydia_Pintscher) and we established the limit should **not** be 255, but 768

Two new secondary database tables are proposed: * wbl_lemmas stores the lemma (a text value) for each Lexeme. * wbl_item_references (or similar) stores two Item references (lexical category and language) for each Lexeme. == Indexes * For the current use case both tables need a primary index on the Lexeme ID. * A fulltext index on the Lemma column is not needed for the current use case, but it is suggested to design the table so one can be added any time later. * Indexes on the two Item reference columns are probably never needed. Backreferences from the Items to Lexemes that use these Items will be done via MediaWiki core's link table, and Wikibase usage tracking. == Table size * WIPThe numbers of rows in the proposed tables is going to be identical to the total number of Lexemes. There is only one lemma per Lexeme (in only one language), only one Item reference for the lexical category, and only one Item reference for the language. This makes the two database tables proposed here significantly different from wb_terms, where each Item can have labels and descriptions in 300+ languages, and users are able to enter as many aliases as they want. == Item references * WIPItem IDs are currently limited to 32 bits (signed). The maximum ID is Q2147483647, which is 11 characters. * Some day we might need to switch to 64 bits (still signed). The maximum is then Q9223372036854775807. That's 20 characters. * Since we know we are exclusively dealing with Item references, but no other entity types, we could store the references as integers. Is this worth it with regards to performance? Or is an indexed VARCHAR column as efficient as an indexed INT column? Suggestion is to go with VARCHAR(20) for both columns. == Lemma column length * The numbers of rows in the proposed tables is going to be identical to the total number of Lexemes. There is only one lemma per Lexeme (in only one language), only one Item reference for the lexical category, and only one Item reference for the language. This makes the two database tables proposed here significantly different from wb_terms, where each Item can have labels and descriptions in 300+ languages, and users are able to enter as many aliases as they want. * In contract to Item labels, the lemma of a Lexeme is (by definition) a single word only. * One of the longest words in an English dictionary is "Supercalifragilisticexpialidocious" (34 characters). * Some chemicals have names with tens of thousands, even hundreds of thousands of characters. What such extreme examples basically mean is: Whatever limit we choose, it will be arbitrary. There will always be exceptions. We must always think about truncation. * Rendering lemmas with thousands of characters untruncated in contexts that reference the Lexeme (but are not meant to represent the Lexeme like the Lexeme page itself does) certainly does not make sense. When a lemma is used in the visible text or tooltip of a link, some truncation must happen. Otherwise a single link would span multiple lines or even paragraphs. A trivial truncation algorithm that prooved to be sufficent many times (e.g. in TwoColConflict) is to hard truncate via the database, and apply a CSS ellipsis to hide the hard truncation. * If we want to make sure MySQL can index all characters in a VARCHAR column, we should not go beyond VARCHAR(768). See https://phabricator.wikimedia.org/T154660#2936497 for a very closely related discussion. * Labels, descriptions, and aliases are currently limited in two ways: to 250 Unicode characters via a setting "multilang-limits", as well as 255 bytes via a VARCHAR(255) in the wb_terms table. * If we make sure we are able to expand the table structure later, we could start with VARCHAR(255), and later expand to VARCHAR(768) or further if needed. I talked to PM (@Lydia_Pintscher) and we established the limit should **not** be 255, but 768.