Page MenuHomePhabricator

Investigation: Constraints for a database schema to store representations of a Lexeme
Closed, ResolvedPublic

Description

Two new secondary database tables are proposed:

  • wbl_lemmas stores the lemma (a text value) for each Lexeme.
  • wbl_item_references (or similar) stores two Item references (lexical category and language) for each Lexeme.

Indexes

  • For the current use case both tables need a primary index on the Lexeme ID.
  • A fulltext index on the Lemma column is not needed for the current use case, but it is suggested to design the table so one can be added any time later.
  • Indexes on the two Item reference columns are probably never needed. Backreferences from the Items to Lexemes that use these Items will be done via MediaWiki core's link table, and Wikibase usage tracking.

Table size

  • The numbers of rows in the proposed wbl_item_references table is going to be identical to the total number of Lexemes. According https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model there is only one Item reference for the lexical category, and only one Item reference for the language. This makes the table significantly different from wb_terms, where each Item can have labels and descriptions in 300+ languages, and users are able to enter as many aliases as they want.
  • A Lexeme can have multiple lemmas, one per language code. (Note: The exact set of language codes is currently unspecified, but is most probably going to be limited to a configured set of a few hundred.) While similar, lemmas are still distinctively different from Item labels. For Item labels users are expected to enter as many as they can, while for lemmas users will only enter multiple lemmas if the Lexeme requires it. Example: The American "color" and British "colour" are supposed to be modeled as lemmas on a single Lexeme. A best-case approximation are two lemmas per Lexeme, a worst-case approximation are ten lemmas per Lexeme.

Item references

  • Item IDs are currently limited to 32 bits (signed). The maximum ID is Q2147483647, which is 11 characters.
  • Some day we might need to switch to 64 bits (still signed). The maximum is then Q9223372036854775807. That's 20 characters.
  • Since we know we are exclusively dealing with Item references, but no other entity types, we could store the references as integers. Is this worth it with regards to performance? Or is an indexed VARBINARY column as efficient as an indexed INT column?

Suggestion is to go with VARBINARY(20) for both columns.

Suggestion is to not reserve space for repository prefixes. Even on a multi-repository setting the Item references used for lexical categories and languages should always come from a single repository.

Lemma column length

  • In contract to Item labels, lemmas are (mostly, but not exclusively) single words only.
  • One of the longest words in an English dictionary is "Supercalifragilisticexpialidocious" (34 characters). Some other candidates for a longest word in other languages are about 70 characters long, see http://mentalfloss.com/article/50611/longest-word-in-the-world.
  • Some chemicals have names with tens of thousands, even hundreds of thousands of characters. What such extreme examples basically mean is: Whatever limit we choose, it will be arbitrary. There will always be exceptions. We must always think about truncation.
  • Rendering lemmas with thousands of characters untruncated in contexts that reference the Lexeme (but are not meant to represent the Lexeme like the Lexeme page itself does) certainly does not make sense. When a lemma is used in the visible text or tooltip of a link, some truncation must happen. Otherwise a single link would span multiple lines or even paragraphs. A trivial truncation algorithm that prooved to be sufficent many times (e.g. in TwoColConflict) is to hard truncate via the database, and apply a CSS ellipsis to hide the hard truncation.
  • If we want to make sure MySQL can index all characters in a VARCHAR column, we should not go beyond VARCHAR(768). See https://phabricator.wikimedia.org/T154660#2936497 for a very closely related discussion.
  • Labels, descriptions, and aliases are currently limited in two ways: to 250 Unicode characters via a setting "multilang-limits", as well as 255 bytes via a VARCHAR(255) in the wb_terms table.
  • If we make sure we are able to expand the table structure later, we could start with VARCHAR(255), and later expand to VARCHAR(768) or further if needed.

I talked to PM (@Lydia_Pintscher) and we established the limit should not be 255 bytes, but 768.

Event Timeline

thiemowmde triaged this task as Medium priority.Feb 20 2018, 1:01 PM
thiemowmde created this task.
thiemowmde moved this task from Backlog to Review on the Wikidata-Sprint-2018-02-14 board.
thiemowmde moved this task from incoming to in progress on the Wikidata board.

There is only one lemma per Lexeme (in only one language)

Don’t we have something to support e. g. “color” and “colour” for the same lexeme? I’m not sure if that’s two lemmas or one lemma (multilingual text) with two spellings, but there seems to be some need for multiple texts and languages per lexeme (though the average number of texts per lexeme will probably only be slightly above one).

Since we know we are exclusively dealing with Item references, but no other entity types, we could store the references as integers.

What about federation? I guess it could still work, at least with the way federation currently works (configured per entity type), where we know that all items reside in one repository.

Am 20.02.2018 um 15:44 schrieb Lucas_Werkmeister_WMDE:

Lucas_Werkmeister_WMDE added a comment.

There is only one lemma per Lexeme (in only one language)

Don’t we have something to support e. g. “color” and “colour” for the same
lexeme? I’m not sure if that’s two lemmas or one lemma (multilingual text) with
two spellings, but there seems to be some need for multiple texts and languages
per lexeme (though the average number of texts per lexeme will probably only be
slightly above one).

Yes, that's one of the standard examples of a Lexeme with multiple lemma
variants. The representations of Forms of that lexeme can then also have
multiple variants (plural: colors/colors).

Variants are used to cover different spellings or alphabets. It's till one Lemma
(and one Form).

Other examples include stopp vs stopp, Tal vs Thal, etc.

Since we know we are exclusively dealing with Item references, but no other
entity types, we could store the references as integers.

What about federation? I guess it could still work, at least with the way
federation currently works (configured per entity type), where we know that all
items reside in one repository.

Federation is indeed currently limited to one repo per entity type. That
limitation is however not an assumption spread throughout the code, but is a
imposed by the implementation details of one or two classes. So it's (in theory)
easy to change. We should be very careful not to rely on this assumption anywhere.

But yes, for now, we rely on all items to reside in a single repo: Wikidata.
That should be sufficient for our use case.

One of the longest words in an English dictionary is "Supercalifragilisticexpialidocious" (34 characters).

General note: English is probably not the best language to look for in the context of long words (even German beats it easily).

In contract to Item labels, the lemma of a Lexeme is (by definition) a single word only.

I am not entirely sure to what conclusion this remark would lead us, but "single word" is pretty relative here. I imagine people might want to model things like idioms as lexemes, meaning there will be e.g. multi-word lemmas etc. That said, I guess 250 characters (i.e. Unicode characters, not bytes) seems like a good initial length limit. So your conclusion here seems reasonable to me.
If we feel like it should be investigated further though, it seems like we could fairly easily get some data from wiktionaries, and I would risk saying that data would give us a pretty good estimation of sizes of things people are going to put in as lexemes and lemmas.

A "word" is also not the most clear term from the linguistic point of view, BTW.

We should fix https://commons.wikimedia.org/wiki/File:Lexeme_data_model.png then, because it very prominently says there is only "one" lemma. It could be this is meant to be interpreted as "one" value that can somehow contain multiple values. I wonder what the benefit of phrasing it like this is.

Regarding number of lemmas per lexeme, @Lucas_Werkmeister_WMDE makes a good point. As far as I remember, @thiemowmde and I talked IRL last week about the number there, and we said something like that the secure guesstimate would be to say the total number of lemmas would be 10*N, where N is a number of lexemes.
While I think this is a fairly reasonable guess, it is still the guess. We might be able to collect more data on this topic if this seems necessary. I personally think it is not :)

I just had a chat with @Ladsgroup and he suggested regarding wbl_lemmas table the following: what about not putting this stuff in the database table but storing all lemmas for display in the cache (or cache them when they're used). I am bit ignorant, but as wbl_lemmas is made for display only (which is my understanding), this seems like a plausible thought. Of course as long as the amount of data is something which is fine for caching.

I am not aware of all constraints, so any thoughts on this @thiemowmde and @daniel?

Personally, I'm totally fine with using any kind of cache, might it be an in-memory one or something else. My worst-case scenario is as follows: Let's say we have 10 million Lexemes, 2 lemmas per Lexeme, 20 bytes per lemma. The cache would need to hold about 0.4 gigabytes. What kind of cache would be ok with such a size?

Our idea would be to have it as memcached.

@WMDE-leszek, something like this would be my draft:

CREATE TABLE IF NOT EXISTS wbl_lexemes (
  lex_lexeme_id VARBINARY(20) NOT NULL PRIMARY KEY,
  lex_lexical_category_id VARBINARY(20) NOT NULL,
  lex_language_item_id VARBINARY(20) NOT NULL
);

CREATE TABLE IF NOT EXISTS wbl_lemmata (
  lem_lexeme_id VARBINARY(20) NOT NULL,
  lem_language_code VARBINARY(32) NOT NULL,
  lem_lemma VARCHAR(768) BINARY NOT NULL
);
CREATE UNIQUE INDEX wbl_lem_lexeme_id_language_code ON wbl_lemmata (lem_lexeme_id, lem_language_code);

(Minor comment – the MediaWiki database coding conventions prefer singular table names, i. e. wbl_lexeme and wbl_lemma. But I don’t know if there’s a different convention within Wikibase.)

(Also, is the use of VARCHAR BINARY instead of VARBINARY for lem_lemma intentional? I don’t know what the difference between them is.)

  • wb_terms is plural. Most MediaWiki core tables are plural. I also like plural names for tables more. But in the end it really does not matter.
  • I used VARBINARY and VARCHAR BINARY as they currently are on other Wikibase tables. From https://dev.mysql.com/doc/refman/5.7/en/binary-varbinary.html: "The […] VARBINARY data types [is] distinct from the […] VARCHAR BINARY data type. […] the BINARY attribute does not cause the column to be treated as a binary string column. Instead, it causes the binary (_bin) collation for the column character set to be used, and the column itself contains nonbinary character strings rather than binary byte strings." https://dev.mysql.com/doc/refman/5.7/en/charset-binary-collations.html explains this in much more detail. Based on this I believe whats suggested above is correct: Use VARBINARY for Item IDs that are known to not contain multi-byte characters, but VARCHAR BINARY for values that will contain multi-byte characters.

Estimated table sizes:

  • wbl_lexemes
    • The latest Item ID is currently Q49977198. Thats 9 bytes.
    • 9 * 3 = 27 bytes per row.
    • 27 * 1 million Lexemes = 26 megabytes.
  • wbl_lemmata
    • Lexeme IDs will be similar to Item IDs, so 9 bytes again.
    • Lets say language codes are 5 bytes on average (e.g. stuff like "en-gb").
    • Lets say lemmas are 15 characters on average (see http://www.ravi.io/language-word-lengths).
    • Lemmas will use multi-byte UTF-8 characters in many cases. I suggest to assume a factor of 4 bytes per character, just to be sure.
    • Lets say a Lexeme does have 2 lemmas on average.
    • ( 9 + 5 + ( 15 * 4 ) ) * 2 * 1 million Lexemes = 141 megabytes.

This investigation has been concluded with stating that the table structure as outlined here will be used, with possible minor adjustments (table names, etc).
Before proceeding with the final design of the tables, and implementing the related code, it has been to decided to estimate whether the index-based display of lexemes is necessary for the needs of display in the statements at this point, or whether it could be prolonged until the later stage, see T188108.