Page MenuHomePhabricator

[EPIC] Access to Wikidata's lexicographical data from Wiktionaries and other WMF sites
Open, Needs TriagePublic

Description

Now we have lexicographical data on Wikidata, it should be possible to reuse it on Wiktionaries with Lua functions.

Open questions for the communities:

  • what kind of data do you want to access?
  • what are the use cases you could imagine?
  • what Lua functions would be helpful for you?

Event Timeline

Addshore rescinded a token.
Addshore awarded a token.
deryckchan renamed this task from [EPIC] Access to Wikidata's lexicographical data from Wiktionaries to [EPIC] Access to Wikidata's lexicographical data from Wiktionaries and other WMF sites.Jan 26 2019, 11:09 PM

Hello @deryckchan, and thanks for creating this task!
We're actually considering this for the year to come, but before starting developping anything, we need to understand better what people would like to do with the data, how they would like to display it on their wiki, what kind of Lua functions they would need.
If you already have some ideas, or use cases, feel free to share :)

As I wrote on T213941

I would envisage this being done using parser functions, similar to {{#property:}} for Q-items.

As the first step, we should make {{#statements:}} and {{#property:}} work for Lexemes too. For example, we should make these work:

  • {{#statements:P5974|from=Q4115189}} (i.e. domain:item; datatype:Lexeme/Sense/Form) currently outputs the Lexeme-Sense ID "L123-S2". It should output the lemma or gloss
  • {{#statements:P5974|from=L123-S2}} (i.e. domain:Lexeme/Sense/Form; datatype:anything) throws parser error at the moment

This will address migration blocks like https://www.wikidata.org/wiki/Wikidata:Properties_for_deletion#Property:P2521, where the lack of a feature to call Lexemes in Wikipedias is blocking the migration of a property.

My suggestion would be to simply mirror the functions that are currently available for Q-items - either by duplicating the code that does that and changing "Q" to "L", or better, generalizing it so that it works for all of Wikibase's namespaces (P/Q/L/M/...). Then it can be built upon on-wiki as needed (e.g., through Module:WikidataIB). That would also help structured data on commons, and future projects using wikibase.

Pamputt added a subscriber: Pamputt.Jun 4 2019, 8:25 PM

For the French Wiktionary, I do not know what will decide the community but if we decide one day to use the Lexeme data from Wikidata, it will be the most probably for the Forms (conjugation, inflection, declension, etc). I think we will never use the Senses. So what Mike Peel proposed just before makes sense for a full flexibility.

This will address migration blocks like https://www.wikidata.org/wiki/Wikidata:Properties_for_deletion#Property:P2521, where the lack of a feature to call Lexemes in Wikipedias is blocking the migration of a property.

Note that the discussion has been archived. It is now available here: https://www.wikidata.org/wiki/Wikidata:Requests_for_deletions/Archive/2019/Properties/1#female_form_of_label_(P2521)

RexxS added a subscriber: RexxS.Jun 4 2019, 9:53 PM

I'd like to have a complete collection of api calls exposed to Scribunto. I should be able to get the following:

getEntity - the whole object (probably expensive, but would mostly be used to look at structures)
getLanguage - entity ID like Q1860 for 'English'
getLexicalCategory - entity ID like Q24905 for 'verb'
getStatements - table
getSenses - table
getForms - table (each value is an entity ID along with qualifiers 'Grammatical features', a table of entity IDs like Q110786 for 'singular, etc.)

That would be enough, in my opinion, for me to write almost any Scribunto code that the folks at the Wiktionaries and other sites could ask for (until you start changing the structure of the lexemes, of course). If all of these returned values are normal q-numbers (entity IDs), I already have plenty of code to handle getting labels, sitelinks, etc. to display in the local or preferred language, so we probably wouldn't need to worry about further internationalisation.

We have a bunch of words and forms uploaded in Basque, they should be at least 5.000, and as euwikt is quite dead, this could be a good boost to the project.

If someone wants to use basque wiktionary for testing purposes, let's talk about it.

Iniquity added a subscriber: Iniquity.

The Basque collection is even more complete now!
I do think some customization may be needed for Lexemes due to the different structure - the forms and senses etc. Perhaps the most useful link for a wiktionary may be from words to senses to wikidata items via the "item for this sense" property. That in principle allows translations to be provided, grouped by sense.

One UI suggestion would be: when searching for a word in a wiktionary, if it is NOT found, any matching Wikidata forms from that or any other language could be shown, so this provides an immediate supplement to small Wiktionaries, and there may even be a few words missing from enwikt that could be found in Wikidata.

Iniquity added a comment.EditedSep 12 2019, 9:18 PM

One UI suggestion would be: when searching for a word in a wiktionary, if it is NOT found, any matching Wikidata forms from that or any other language could be shown, so this provides an immediate supplement to small Wiktionaries, and there may even be a few words missing from enwikt that could be found in Wikidata.

Perhaps ArticlePlaceholder will be interested in this? @Lydia_Pintscher what do you think about this idea?
We can also use best practices from https://tools.wmflabs.org/hauki/. cc @Vesihiisi.

Yurik added a subscriber: Yurik.EditedSep 13 2019, 2:53 AM

I have imported some Russian nouns (~20,000 so far, but will be more soon), plus added links from Wiktionary's pages to the corresponding Lexemes. I think the simplest use case for Lexemes would be to allow Wiktionary Lua script to be able to load Lexeme by its ID. This will instantly make Lexemes useful to Wiktionary because the Lua script will be able to:

  • generate table of the word forms
  • generate etymology and pronunciation sections
  • do the above for every lexeme if more than one is used on the page.

Note that the last point makes it substantially different from the regular Wikipedia usage because it is likely that more than one Lexeme corresponds to a single Wiktionary page. Also, while nice to have, it is not really required for Wiktionary to be able to read Wikidata Q items because those could be hardcoded in Lua (the list of used Q-IDs is not too big - under a thousand)

Yurik added a comment.EditedSep 13 2019, 7:14 PM

P.S. to sum up -- Wiktionary needs just a single Lua function for the minimum viable product: getEntity('L100000') that simply returns the whole Lexeme JSON. Everything else is optional.

TomT0m added a subscriber: TomT0m.Sep 22 2019, 6:26 PM
This comment was removed by TomT0m.
RexxS added a comment.Sep 23 2019, 2:28 PM

We started using Scribunto to read Wikidata items in exactly that way - just loading the entire entity as an object and working from that. There are two downsides that became apparent:

First, the resources consumed made this an "expensive" call unless it was done from the page that was already linked to the Wikidata item.

Second, because all of the Wikidata object was loaded, including descriptions, aliases, labels, etc. in every language, any change to any of those in any language threw up an entry in the watchlist for anybody watching the Wikipedia article where that item was loaded. That swamped watchlists with irrelevant entries and caused many Wikipedia editors to turn off monitoring of Wikidata changes.

Unless anyone can think of a good reason not to, having calls that return single items from a large entity is far more efficient and can make watchlisting feasible. We should definitely be planning to achieve that functionality, even if we begin by loading the entire entity in order to get started.

@RexxS you do bring up a valid point about watchlist. The minor difference here is that lexeme is tied to a specific language, so it is less likely to have content not relevant to that one language / wiktionary. The only exception might be the description of sensese in other languages. TBH, I am not sure that adding sense description in a non-native language is a scalable solution -- we are repeating the issue of sitelinks, where every wiki page referenced all other wiki pages on the same subject. But this is a separate discussion, unrelated to this ticket.

Performance-wise, there is not much difference -- lexemes are not attached (yet) to wiktionary pages, the way wikidata item are attached with their sitelinks, so every lexeme retrieval will be "expensive". On the other hand, getting just a handful (at most) lexemes per wiktionary page should not affect performance in a significant way. And since most of the content will be relevant to the page generation, having multiple calls might actually be slower than rendering a large chunk of page in a single template with a module, where that module would get the whole lexeme content.

Lastly, we could always optimize the process, but remember that having a simple interface to get the entire lexeme is far quicker to implement than to have a very complex system - so at the end it might be better, but in the mean time you won't have it for several years (?), and you may need to allocate resources to this project at an expense of another project.

Thank you for all your input so far. That's really helpful.
I have one more question: How many Lexemes would you expect to load on a single Wiktionary page on average? How many Lexemes would you need to load for it to be useful for you?

I count over 30 basic lexemes on https://en.wiktionary.org/wiki/for while there may be more when we start to count inflections and derived terms ...

Yurik added a comment.Fri, Sep 27, 3:16 PM

@Lydia_Pintscher most of the Wiktionary pages have just one corresponding lexeme - and that's all I would expect to load.

Some statistics: https://w.wiki/8xw (note that this is per language, not just when lemmas match)

lexemes_per_wordwords
1173680
24659
3351
465
515
68
73
81

The tricky bit comes when a page has multiple associated lexemes -- yes, in theory there could be up to 8 (per query result), but I think this is a mistake to store so many lexemes per word -- most of them have identical forms, pronunciation, and top-level claims. They only differ in their meaning - and as such, we should put that meaning inside the senses.

Yurik added a comment.Fri, Sep 27, 3:25 PM

P.S. @Fnielsen does bring a valid point about various linked lexemes , and that might be useful -- for example if lexeme lists another lexeme as being a synonym, it would be good to show it as a word rather than an L-number.

That said, I do not believe we need it just yet -- it will take a while for the synonyms to be populated to the level of wiktionary, so for now lexemes will be needed just for the "infoboxes" -- e.g. list all forms and basic info, not the advanced features.

At this point, I can easily replace the {{noun ru|...}} template (generates morphology summary and a forms table), but I won't be able to easily replace the synonyms section with the auto-generated content, and thus, linked lexemes are somewhat useless until they have much better coverage.

Here a variation on @Yurik's query with count on within-language forms: https://w.wiki/8y8 (current count is obfuscated by Tamil annotation). For instance, 'led' in Ordia shows 9 lexemes from 3 language: https://tools.wmflabs.org/ordia/representation/led

lexemes_per_representationnumber_of_representationsexample_representation
551பெயர்ச்சொல்
471ஒருமை
441noun
331singular
810сибирка
726led
661かえる
5191かえ
4678lede
32885engagerat
230482bager
11572338كتبت
Yurik added a comment.Fri, Sep 27, 5:15 PM

@Fnielsen i am not sure I understand what that query does, could you elaborate? Especially I am confused why you look at the forms -- from the perspective of Wiktionary, you request a single Lexeme, not individual forms. (btw, the query times out for me).

Also, I just realized that I shouldn't have grouped by the language, because in Wiktionary each page is per Lemma, regardless of which language contains it. So if Wiktionary wants to show data about all lexemes spelled a certain way, the query becomes https://w.wiki/8yY (the results are nearly identical -- words are still by far unique, at least with what we currently have in WD):

lexemes_per_wordwords
1173657
24670
3351
466
515
68
73
81

The query is a bit hard on WDQS. If one execute it twice then the second time can apparently use some caching from the first.

The query counts that there are, e.g., 7 Danish lexemes of 'led' from what Ordia shows are 9 different forms. In Wiktionary, I suppose we would like to have all 9 forms shown - either fully or just as a redirect. The 9 forms can be fetched from 7 different Wikidata lexeme pages. That is just for one language. I have a problem with formulating a language-agnostic query that doesn't timeout in WDQS. The Ordia page https://tools.wmflabs.org/ordia/representation/led show that there are two more lexemes we should get to make the full Wiktionary page for 'led', - one Czech lexeme (with two forms) and one English lexeme

I would like to be able to access all forms matching a particular set of grammatical features from Wiktionary, so that a template can be made for example where a lexeme ID is given and a table will be returned with all the forms as per Wiktionary info. For a very basic example, see the table on https://no.wiktionary.org/wiki/tirsdag#Grammatikk