Page MenuHomePhabricator

Investigation: How to format lexeme to be display in a statement value efficiently
Closed, ResolvedPublic

Description

Following T185332 which provided a possible way to display formatted lexeme in a statement.

The approach outlined in T185332 however requires the "whole" lexeme to be loaded from the storage in order to display the lexeme.
This is not an efficient enough approach in the production scale.
Preferably, only the actual "labels" would be loaded from the storage, and if "labels" of multiple entities must be loaded, they're only loaded once and all at once.

The approach to be investigated is storing data relevant for displaying lexemes in some data structure, possibly something reminding wb_terms for items and properties.
Data needed for searching for lexemes is NOT going to be stored there.

The outcome of the investigation would be some code that outlines to the possible implementation (intentionally simplified etc), that allows to answer questions whether the approach is viable, whether localizable data (e.g. labels of language and lexical category items) is stored in a scalable enough manner, and whether the approach could be also satisfying for other uses where lexeme is being "displayed" in some way, and/or whether the approach also "scales" for similar use case of forms and senses.

Planned time of investigation is one day.

The alternative, which is out of scope of this investigation, would be to stored needed data in wb_terms table somehow.

Event Timeline

WMDE-leszek triaged this task as High priority.
WMDE-leszek created this task.
WMDE-leszek moved this task from Backlog to In Progress on the Wikidata-Sprint-2018-01-31 board.
WMDE-leszek renamed this task from Investigation: How to format lexeme to be display in a statement value efficiently to Investigation: How to format lexeme to be display in a statement value efficiently (days: 1).Feb 8 2018, 8:28 AM
WMDE-leszek renamed this task from Investigation: How to format lexeme to be display in a statement value efficiently (days: 1) to Investigation: How to format lexeme to be display in a statement value efficiently (days: 2).Feb 8 2018, 11:42 AM
WMDE-leszek renamed this task from Investigation: How to format lexeme to be display in a statement value efficiently (days: 2) to Investigation: How to format lexeme to be display in a statement value efficiently (days: 3).Feb 12 2018, 9:26 AM

Change 409830 had a related patch set uploaded (by WMDE-leszek; owner: WMDE-leszek):
[mediawiki/extensions/WikibaseLexeme@master] [DNM] Add SQL-index-based LexemePresenter

https://gerrit.wikimedia.org/r/409830

WMDE-leszek renamed this task from Investigation: How to format lexeme to be display in a statement value efficiently (days: 3) to Investigation: How to format lexeme to be display in a statement value efficiently.Feb 12 2018, 10:56 AM
WMDE-leszek renamed this task from Investigation: How to format lexeme to be display in a statement value efficiently to Investigation: How to format lexeme to be display in a statement value efficiently (days: 1).Feb 12 2018, 11:06 AM

Regarding the patch: It is not meant as a perfect solution, rather to give a general overview on possible options.

I’ve tried to look at the patch (and chain) and I think I’m starting to understand your proposal…

Question: if I remember correctly, there was some talk about fetching terms / lemmas in bulk, so that on a page listing 50 lexemes you wouldn’t make 3×50 database requests (lemma text, language item label and lexical category item label for each lexeme). I’m not seeing anything in that direction in the patch – is that correct or am I missing something? (Not judging, just checking :) )

So yes, the patch is doing it for a single lexeme only. Bulk fetching would be the next step. You're right, there is no POC for this, this has been left for the imagination of the careful reader.

WMDE-leszek renamed this task from Investigation: How to format lexeme to be display in a statement value efficiently (days: 1) to Investigation: How to format lexeme to be display in a statement value efficiently (days: 2).Feb 13 2018, 11:08 AM

Story time questions

Questions we collected on 2018-02-06, copied from the PM/Engineering time document for reference:

  • Q: When listing multiple lemmas, how does listing work?
    • A: PM says “unordered” is fine for now, until otherwise demanded. "Unordered" currently means as stuff have been created.
  • Q: In which language is the “/” between multiple lemmas?
    • A: PM suggests to use the users language.
  • Q: Do we ever need a derived label to contain links? Or is plain text always enough?
    • A: PM thinks links inside a derived label are almost always more confusing than helpful. Possible exception: Summaries. But this is outside of the scope of the current story.
  • Q: How to apply language fallbacks on the individual parts?
    • A: PM wants this to be consistent with how fallback chains work everywhere else: all fall back to English, or the item ID if English is missing.
  • Q: Store “Ladder (English, Noun)” as one string, or individually?
    • A: Must be stored individually, for various reasons. One is that the individual elements must be marked with <span lang="…">…</span>.
  • Q: Store “English, Noun” as one string, or individually?
    • A: PM does not care that much. Probably needs to be stored individually for the same reason as above.

Other questions not relevant for PM:

  • Q: Store derived labels for all languages we support in advance?
  • Q: We are going to have stuff like “English, Noun” repeated a lot. Is it worth optimizing the storage layer for duplications?
  • Q: Can the same solution we investigate here work for MediaInfo?
  • Q: Can the solution we investigate here replace Label/DescriptionLookups in Wikibase? See T163538.

Proof of concept

My review of https://gerrit.wikimedia.org/r/409830 and related:

  • Two new secondary tables are introduced:
    • One stores the individual lemmas from a Lexeme, as strings. These can be used directly, similar to how wb_terms is used.
    • One stores the lexical categories and languages, as item IDs. These references are used to query wb_terms, where the Item labels are stored.
  • The current implementation does not do any prefetching for multiple Lexeme references.
  • It also does not do any prefetching for multiple Item references.

My impression is that this approach is the one we should follow, and build up as we need to. Things to consider:

  • Can we estimate how big the two new secondary tables might grow?
  • We must think about prefetching or something else to avoid querying the database one (or even multiple) times for each Lexeme reference individually. Can we already write down a story and actionable tasks for this?
WMDE-leszek claimed this task.
WMDE-leszek moved this task from Review to Done on the Wikidata-Sprint-2018-01-31 board.

I believe this concludes this investigation. Next step would be creating the actual implementation, and one of the requirements would be to somehow "batch" fetching the data of all lexemes on the page, to minimize the amount of DB querying.

Regarding the estimation of the table size. The very rough but secure estimate would be IMO: the lemma table would be of size 10 * number of all lexemes, and the item reference (language and lexical category) table would have the exact number of rows as the number of existing lexemes.
The more accurate and throughout estimates would be provided when we have the actual DB schema draft (I don't consider the proof-of-concept code to be this), and have it discussed with people with more DB expertise.

WMDE-leszek renamed this task from Investigation: How to format lexeme to be display in a statement value efficiently (days: 2) to Investigation: How to format lexeme to be display in a statement value efficiently.Feb 14 2018, 9:39 AM
WMDE-leszek removed a project: Patch-For-Review.

Change 409830 abandoned by WMDE-leszek:
[DNM] Add SQL-index-based LexemePresenter

https://gerrit.wikimedia.org/r/409830