Page MenuHomePhabricator

Get baseline measurements/expectations for splitting lexemes from Wikidata graph
Closed, ResolvedPublic3 Estimated Story Points

Description

As a product manager for Wikidata and WDQS, I want to know what quantifiable benefits to service reliability and quality I might expect to gain (or lose) by splitting Lexemes out from the Wikidata graph, so that I can decide whether to move ahead with this plan and how to communicate it.

In order to move ahead with splitting out Lexemes from WD, communicate this decision, and set expectations around the benefits of implementing this change, we should get some baseline measurements of the current state of Lexemes in Wikidata and WDQS, and estimates about the effects of splitting them off.

AC:
Get the numbers for the following metrics:

  • percentage, number of Wikidata entities that are Lexemes
  • percentage, number of WDQS queries per month that involve Lexemes
    • percentage, number of the above queries that only involve Lexemes (i.e. doesn't require anything from the larger Wikidata graph)
  • percentage, number of Lexemes that are connected to non-Lexeme items in WD
  • given the current rate of growth of Wikidata, approximately how much time it would take for non-Lexeme Wikidata to grow back to its current size
  • potential upper limit of how many Lexemes there could be

Summary of results from this ticket: https://docs.google.com/document/d/1N2ludK2QllzndrlQiQ7c6V1dT3NZBiABQL_kZH1P5Io/edit?usp=sharing

Event Timeline

percentage, number of Wikidata entities that are Lexemes

Wikidata Datamodel currently has 92212288 Items, 404946 Lexemes, and 8450 Properties, so that would be 0.4% Lexemes across all “top-level” entities. If we include Forms and Senses in the count (7053785 and 90003, respectively), all lexicographical entities make up 7.6% of all entities.

percentage, number of Lexemes that are connected to non-Lexeme items in WD

Every Lexeme is connected to at least two Items, its language and lexical category. Additionally, Forms typically have several grammatical feature Items.

potential upper limit of how many Lexemes there could be

My earlier napkin math on this:

For the record, as hard it is to quantify words/lemmas/lexemes cross-linguistically, I think one needs to know on the order of 10^4 words in a language as a typical speaker, with the upper bound of that on the order of 10^5 [1]. There are ~7k living languages in the world today (~300 Wikipedias), so 10^5 x 10^3 = 10^8 is the very generous upper bound for lexemes according to my count, 10^4 x 10^2 = 10^6 on the more realistic end of total wiki coverage — Trey can correct my napkin math sweat 😅. Anyway, 10^8 I think is the order of magnitude of total Wikidata items currently (including lexemes), so there is some potential for lexemes to comprise a reasonably sized subgraph if people use it.

with Trey's follow up:

As for Mike's napkin math, it seems to be in the right ballpark for sure—but... (many) lexicographers are by their nature completionists, so I could see more than 100K lexemes for languages with a very well-established lexicographic tradition (i.e., any major world or regional language). English Wiktionary has more than 350K nouns, 130K adjectives, and 44K verbs—so > 500K total (plus 77K proper nouns.. not sure what to make of that for Wikidata Lexemes). I expect we'll only hit that kind of volume for dozens of languages, though, at least in the early days.
I'm more concerned about forms. (I'm not too worried about senses because I think the average number of senses for words in < 2—though set may have as many as 150 in the OED—but only ~90 in Wiktionary.) Anyway, back to forms; verbs in Romance languages can have ~50 forms. Finnish nouns have ~2200 forms. (OTOH, I checked a couple of Finnish lexemes and they have ~10 forms on them.) Not sure how forms are represented internally, but there can certainly be a lot of them for any given Lexeme.
What happens to the size of the Lexeme subgraph if we assume, say, 20 forms on average for every Lexeme?

percentage, number of WDQS queries per month that involve Lexemes

percentage, number of the above queries that only involve Lexemes (i.e. doesn't require anything from the larger Wikidata graph)

with very naive heuristics and for one day I extracted 529097 queries involving lexemes.
357917 seemed to require data from wikidata but I would not trust this too much. Since the language is a wikidata item a query requesting labels in a language using its language code rather than its QID falls into the category of queries requiring the wikidata graph.
I did not run the analysis on the full month because it's rather slow and given the precision of the heuristics I chose I would not trust these numbers anyways.

If we need more precise numbers the analysis will have to be more involved.

For ref, here are the list of predicates I used to detect a lexeme query: wikibase:lemma, ontolex:lexicalForm, ontolex:representation, ontolex:LexicalEntry, ontolex:sense,dct:language, wikibase:lexicalCategory, wikibase:grammaticalFeature.

given the current rate of growth of Wikidata, approximately how much time it would take for non-Lexeme Wikidata to grow back to its current size

The lexemes RDF dataset is about 77M triples (0.6% of the total size of the graph).
If we were to remove lexemes from the main graph at current growth rate it would take ~10days for wikidata to grow back to the equivalent size.
Note that in the current graph "only" 29316 distinct wikidata items are being referenced from the lexemes.

Thanks, @dcausse!
Do you know what percentage of total queries 529097 and 357917 are? I hear you on not trusting these numbers, and I think ballparking is fine for now.

Thanks, @dcausse!
Do you know what percentage of total queries 529097 and 357917 are? I hear you on not trusting these numbers, and I think ballparking is fine for now.

Sorry just realized my numbers were completely off (it was scanning the whole dataset not just one day...).

So over 225,359,379 queries for March 2021 the simple pattern detected 206,612 queries involving lexemes (~0.09%).