[EPIC] Access to Wikidata's lexicographical data from Wiktionaries and other WMF sites
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Lea_Lacroix_WMDE
	Jan 3 2019, 10:55 AM

Description

Now we have lexicographical data on Wikidata, it should be possible to reuse it on Wiktionaries with Lua functions.

Open questions for the communities:

what kind of data do you want to access?
what are the use cases you could imagine?
what Lua functions would be helpful for you?

Related Objects
Search...

Status	Assigned	Task
Open	None	T212843 [EPIC] Access to Wikidata's lexicographical data from Wiktionaries and other WMF sites
Resolved	Lucas_Werkmeister_WMDE	T195895 Lua function to get Lemma of a Lexeme
Resolved	Lydia_Pintscher	T235901 Implement Lua access to Lexemes, Senses and Forms
Resolved	Lucas_Werkmeister_WMDE	T294224 mw.wikibase.lexeme is nil on Beta Wikidata, and mw.wikibase.mediainfo is nil on Commons (beta+prod)
Resolved	Lucas_Werkmeister_WMDE	T294637 Improvements to the WikibaseLexeme Lua interface (before full rollout)
Resolved	Lucas_Werkmeister_WMDE	T297024 Add methods to get lemma, representation, gloss by language code
Resolved	Lucas_Werkmeister_WMDE	T297404 Remove most of mw.wikibase.lexeme module (remove getLemmas, getLanguage, getLexicalCategory; keep splitLexemeId)
Resolved	Lucas_Werkmeister_WMDE	T297478 Add form:hasGrammaticalFeature( itemId ) Lua method
Resolved	Lucas_Werkmeister_WMDE	T239633 Enable mw.wikibase.getEntity() to load forms and senses
Resolved	Lydia_Pintscher	T294159 Enable Lexeme access on first set of projects
Invalid	None	T294571 Enable access to Lexemes on bn.wikisource.org
Invalid	None	T203220 Enable access to Lexemes on fr.wikisource.org
Resolved	Lydia_Pintscher	T309593 enable Lexeme Lua access on remaining Wikimedia projects

Event Timeline

Lea_Lacroix_WMDE created this task.Jan 3 2019, 10:55 AM

Lea_Lacroix_WMDE added a subtask: T195895: Lua function to get Lemma of a Lexeme.

Addshore awarded a token.Jan 3 2019, 1:44 PM

Addshore rescinded a token.

Addshore awarded a token.

Bugreporter merged a task: T213941: Allow other Wikimedia sites to use Lexeme data.Jan 20 2019, 6:55 AM

Bugreporter added subscribers: deryckchan, Jdforrester-WMF, Addshore.

deryckchan renamed this task from [EPIC] Access to Wikidata's lexicographical data from Wiktionaries to [EPIC] Access to Wikidata's lexicographical data from Wiktionaries and other WMF sites.Jan 26 2019, 11:09 PM

In T213941#4885091, @Lea_Lacroix_WMDE wrote:

Hello @deryckchan, and thanks for creating this task!
We're actually considering this for the year to come, but before starting developping anything, we need to understand better what people would like to do with the data, how they would like to display it on their wiki, what kind of Lua functions they would need.
If you already have some ideas, or use cases, feel free to share :)

As I wrote on T213941

I would envisage this being done using parser functions, similar to {{#property:}} for Q-items.

This will address migration blocks like https://www.wikidata.org/wiki/Wikidata:Properties_for_deletion#Property:P2521, where the lack of a feature to call Lexemes in Wikipedias is blocking the migration of a property.

Liuxinyu970226 subscribed.Feb 7 2019, 10:18 AM

Liuxinyu970226 awarded a token.Mar 23 2019, 11:44 AM

Geertivp subscribed.May 27 2019, 8:46 AM

MarcoSwart subscribed.Jun 4 2019, 11:54 AM

My suggestion would be to simply mirror the functions that are currently available for Q-items - either by duplicating the code that does that and changing "Q" to "L", or better, generalizing it so that it works for all of Wikibase's namespaces (P/Q/L/M/...). Then it can be built upon on-wiki as needed (e.g., through Module:WikidataIB). That would also help structured data on commons, and future projects using wikibase.

Pamputt subscribed.Jun 4 2019, 8:25 PM

For the French Wiktionary, I do not know what will decide the community but if we decide one day to use the Lexeme data from Wikidata, it will be the most probably for the Forms (conjugation, inflection, declension, etc). I think we will never use the Senses. So what Mike Peel proposed just before makes sense for a full flexibility.

In T212843#4912902, @deryckchan wrote:

This will address migration blocks like https://www.wikidata.org/wiki/Wikidata:Properties_for_deletion#Property:P2521, where the lack of a feature to call Lexemes in Wikipedias is blocking the migration of a property.

Note that the discussion has been archived. It is now available here: https://www.wikidata.org/wiki/Wikidata:Requests_for_deletions/Archive/2019/Properties/1#female_form_of_label_(P2521)

I'd like to have a complete collection of api calls exposed to Scribunto. I should be able to get the following:

getEntity - the whole object (probably expensive, but would mostly be used to look at structures)
getLanguage - entity ID like Q1860 for 'English'
getLexicalCategory - entity ID like Q24905 for 'verb'
getStatements - table
getSenses - table
getForms - table (each value is an entity ID along with qualifiers 'Grammatical features', a table of entity IDs like Q110786 for 'singular, etc.)

That would be enough, in my opinion, for me to write almost any Scribunto code that the folks at the Wiktionaries and other sites could ask for (until you start changing the structure of the lexemes, of course). If all of these returned values are normal q-numbers (entity IDs), I already have plenty of code to handle getting labels, sitelinks, etc. to display in the local or preferred language, so we probably wouldn't need to worry about further internationalisation.

There's a Wikimania 2019 submission about Lexemes, by @Fnielsen : https://wikimania.wikimedia.org/wiki/2019:Languages/Wikidata_lexemes

Mike_Peel mentioned this in T223792: Extend mw.wikibase.getEntity lua function to allow accessing Structured Data on Commons items.Jun 27 2019, 4:55 PM

We have a bunch of words and forms uploaded in Basque, they should be at least 5.000, and as euwikt is quite dead, this could be a good boost to the project.

If someone wants to use basque wiktionary for testing purposes, let's talk about it.

Tobias1984 subscribed.Jul 28 2019, 7:29 AM

Iniquity awarded a token.Aug 12 2019, 5:53 AM

Iniquity subscribed.

The Basque collection is even more complete now!
I do think some customization may be needed for Lexemes due to the different structure - the forms and senses etc. Perhaps the most useful link for a wiktionary may be from words to senses to wikidata items via the "item for this sense" property. That in principle allows translations to be provided, grouped by sense.

One UI suggestion would be: when searching for a word in a wiktionary, if it is NOT found, any matching Wikidata forms from that or any other language could be shown, so this provides an immediate supplement to small Wiktionaries, and there may even be a few words missing from enwikt that could be found in Wikidata.

In T212843#5488427, @ArthurPSmith wrote:

One UI suggestion would be: when searching for a word in a wiktionary, if it is NOT found, any matching Wikidata forms from that or any other language could be shown, so this provides an immediate supplement to small Wiktionaries, and there may even be a few words missing from enwikt that could be found in Wikidata.

Perhaps ArticlePlaceholder will be interested in this? @Lydia_Pintscher what do you think about this idea?
We can also use best practices from https://tools.wmflabs.org/hauki/. cc @Vesihiisi.

Iniquity added a subscriber: Vesihiisi.Sep 12 2019, 9:22 PM

I have imported some Russian nouns (~20,000 so far, but will be more soon), plus added links from Wiktionary's pages to the corresponding Lexemes. I think the simplest use case for Lexemes would be to allow Wiktionary Lua script to be able to load Lexeme by its ID. This will instantly make Lexemes useful to Wiktionary because the Lua script will be able to:

generate table of the word forms
generate etymology and pronunciation sections
do the above for every lexeme if more than one is used on the page.

Note that the last point makes it substantially different from the regular Wikipedia usage because it is likely that more than one Lexeme corresponds to a single Wiktionary page. Also, while nice to have, it is not really required for Wiktionary to be able to read Wikidata Q items because those could be hardcoded in Lua (the list of used Q-IDs is not too big - under a thousand)

P.S. to sum up -- Wiktionary needs just a single Lua function for the minimum viable product: getEntity('L100000') that simply returns the whole Lexeme JSON. Everything else is optional.

TomT0m subscribed.Sep 22 2019, 6:26 PM

This comment was removed by TomT0m.

We started using Scribunto to read Wikidata items in exactly that way - just loading the entire entity as an object and working from that. There are two downsides that became apparent:

First, the resources consumed made this an "expensive" call unless it was done from the page that was already linked to the Wikidata item.

Second, because all of the Wikidata object was loaded, including descriptions, aliases, labels, etc. in every language, any change to any of those in any language threw up an entry in the watchlist for anybody watching the Wikipedia article where that item was loaded. That swamped watchlists with irrelevant entries and caused many Wikipedia editors to turn off monitoring of Wikidata changes.

Unless anyone can think of a good reason not to, having calls that return single items from a large entity is far more efficient and can make watchlisting feasible. We should definitely be planning to achieve that functionality, even if we begin by loading the entire entity in order to get started.

@RexxS you do bring up a valid point about watchlist. The minor difference here is that lexeme is tied to a specific language, so it is less likely to have content not relevant to that one language / wiktionary. The only exception might be the description of sensese in other languages. TBH, I am not sure that adding sense description in a non-native language is a scalable solution -- we are repeating the issue of sitelinks, where every wiki page referenced all other wiki pages on the same subject. But this is a separate discussion, unrelated to this ticket.

Performance-wise, there is not much difference -- lexemes are not attached (yet) to wiktionary pages, the way wikidata item are attached with their sitelinks, so every lexeme retrieval will be "expensive". On the other hand, getting just a handful (at most) lexemes per wiktionary page should not affect performance in a significant way. And since most of the content will be relevant to the page generation, having multiple calls might actually be slower than rendering a large chunk of page in a single template with a module, where that module would get the whole lexeme content.

Lastly, we could always optimize the process, but remember that having a simple interface to get the entire lexeme is far quicker to implement than to have a very complex system - so at the end it might be better, but in the mean time you won't have it for several years (?), and you may need to allocate resources to this project at an expense of another project.

Thank you for all your input so far. That's really helpful.
I have one more question: How many Lexemes would you expect to load on a single Wiktionary page on average? How many Lexemes would you need to load for it to be useful for you?

I count over 30 basic lexemes on https://en.wiktionary.org/wiki/for while there may be more when we start to count inflections and derived terms ...

@Lydia_Pintscher most of the Wiktionary pages have just one corresponding lexeme - and that's all I would expect to load.

Some statistics: https://w.wiki/8xw (note that this is per language, not just when lemmas match)

lexemes_per_word	words
1	173680
2	4659
3	351
4	65
5	15
6	8
7	3
8	1

The tricky bit comes when a page has multiple associated lexemes -- yes, in theory there could be up to 8 (per query result), but I think this is a mistake to store so many lexemes per word -- most of them have identical forms, pronunciation, and top-level claims. They only differ in their meaning - and as such, we should put that meaning inside the senses.

P.S. @Fnielsen does bring a valid point about various linked lexemes , and that might be useful -- for example if lexeme lists another lexeme as being a synonym, it would be good to show it as a word rather than an L-number.

That said, I do not believe we need it just yet -- it will take a while for the synonyms to be populated to the level of wiktionary, so for now lexemes will be needed just for the "infoboxes" -- e.g. list all forms and basic info, not the advanced features.

At this point, I can easily replace the {{noun ru|...}} template (generates morphology summary and a forms table), but I won't be able to easily replace the synonyms section with the auto-generated content, and thus, linked lexemes are somewhat useless until they have much better coverage.

Here a variation on @Yurik's query with count on within-language forms: https://w.wiki/8y8 (current count is obfuscated by Tamil annotation). For instance, 'led' in Ordia shows 9 lexemes from 3 language: https://tools.wmflabs.org/ordia/representation/led

lexemes_per_representation	number_of_representations	example_representation
55	1	பெயர்ச்சொல்
47	1	ஒருமை
44	1	noun
33	1	singular
8	10	сибирка
7	26	led
6	61	かえる
5	191	かえ
4	678	lede
3	2885	engagerat
2	30482	bager
1	1572338	كتبت

@Fnielsen i am not sure I understand what that query does, could you elaborate? Especially I am confused why you look at the forms -- from the perspective of Wiktionary, you request a single Lexeme, not individual forms. (btw, the query times out for me).

Also, I just realized that I shouldn't have grouped by the language, because in Wiktionary each page is per Lemma, regardless of which language contains it. So if Wiktionary wants to show data about all lexemes spelled a certain way, the query becomes https://w.wiki/8yY (the results are nearly identical -- words are still by far unique, at least with what we currently have in WD):

lexemes_per_word	words
1	173657
2	4670
3	351
4	66
5	15
6	8
7	3
8	1

The query is a bit hard on WDQS. If one execute it twice then the second time can apparently use some caching from the first.

The query counts that there are, e.g., 7 Danish lexemes of 'led' from what Ordia shows are 9 different forms. In Wiktionary, I suppose we would like to have all 9 forms shown - either fully or just as a redirect. The 9 forms can be fetched from 7 different Wikidata lexeme pages. That is just for one language. I have a problem with formulating a language-agnostic query that doesn't timeout in WDQS. The Ordia page https://tools.wmflabs.org/ordia/representation/led show that there are two more lexemes we should get to make the full Wiktionary page for 'led', - one Czech lexeme (with two forms) and one English lexeme

I would like to be able to access all forms matching a particular set of grammatical features from Wiktionary, so that a template can be made for example where a lexeme ID is given and a table will be returned with all the forms as per Wiktionary info. For a very basic example, see the table on https://no.wiktionary.org/wiki/tirsdag#Grammatikk

Lucas_Werkmeister_WMDE mentioned this in T235901: Implement Lua access to Lexemes, Senses and Forms.Oct 18 2019, 5:10 PM

So9q subscribed.Nov 18 2019, 3:16 PM

Marsupium subscribed.Nov 19 2019, 1:55 PM

mxn subscribed.Nov 28 2019, 12:51 AM

@Lucas_Werkmeister_WMDE thank you for all the hard work on this task! Do you have any approximate timeline of the getEntity() returning all lexeme forms, or is that already implemented? How significant of a challenge is it? I have been spending considerable time updating Lexicator bot to parse multiple Wiktionary languages, and handle multiple linguistic types, but all that work is mostly pointless until Wiktionaries can access that data.

Thanks!

Scott_WUaS subscribed.Jan 22 2020, 6:17 PM

He7d3r subscribed.Jun 18 2020, 1:31 PM

darthmon_wmde subscribed.Sep 22 2020, 12:55 PM

Bodhisattwa added a subtask: T203220: Enable access to Lexemes on fr.wikisource.org.Nov 7 2020, 6:09 PM

Bodhisattwa subscribed.

Just a reminder that community people regularly come to me asking when we can implement the integration of Lexicographical data on Wiktionaries. Example from today on French Wiktionary.

Pols12 subscribed.Dec 3 2020, 5:59 AM

Mugli subscribed.Dec 6 2020, 4:14 AM

Amire80 subscribed.Dec 19 2020, 8:06 AM

How is this going? Any progress? I would like to help on this, but I dunno where to start.

matej_suchanek closed subtask T195895: Lua function to get Lemma of a Lexeme as Resolved.Dec 27 2020, 12:52 PM

Mahir256 added a subscriber: Lepticed7.Feb 16 2021, 7:36 PM

A usecase recently discussed here with @Lepticed7: the idea of creating tables of conjugations on Wiktionary using Lexemes. One would need to retrieve the Forms of a Lexeme (ideally, by giving specific grammatical features as input and getting back the corresponding Form, or just by getting a list of all the Forms and grammatical features, that will then be organized later by the Lua module on the client).

Quiddity subscribed.Mar 24 2021, 5:31 PM

So9q added a subscriber: Salgo60.Mar 24 2021, 6:49 PM

Jsamwrites subscribed.Mar 24 2021, 7:35 PM

Alicia_Fagerving_WMSE subscribed.Apr 15 2021, 9:33 AM

I (as one of wiktionarians) probably want to edit lexicographical data completely on wikidata, and leaving wiktionary as "presentation layer".

So wikidata as "data layer"/backend and wiktionary as "presentation layer"/frontend.

So, maybe, automatic generation of wiktionary pages directly from wikidata? We gonna move our project, and editing on wikidata right now. And let the bots automatically generate the wiktionary page there.

abian subscribed.Oct 5 2021, 4:14 PM

Lucas_Werkmeister_WMDE closed subtask T239633: Enable mw.wikibase.getEntity() to load forms and senses as Resolved.Oct 21 2021, 9:38 AM

DVrandecic mentioned this in T203220: Enable access to Lexemes on fr.wikisource.org.Oct 22 2021, 7:17 PM

DVrandecic mentioned this in T294159: Enable Lexeme access on first set of projects.Oct 22 2021, 10:34 PM

DVrandecic added a subtask: T294159: Enable Lexeme access on first set of projects.

Bodhisattwa added a subtask: T294571: Enable access to Lexemes on bn.wikisource.org.Oct 28 2021, 2:51 PM

Lucas_Werkmeister_WMDE removed a subtask: T294571: Enable access to Lexemes on bn.wikisource.org.Oct 29 2021, 11:45 AM

Lucas_Werkmeister_WMDE removed a subtask: T203220: Enable access to Lexemes on fr.wikisource.org.

Hello all,

We are very happy to announce that this week we will deploy a first betaversion of Lua access to Lexemes on Bengali and Basque Wiktionaries!

Full announcement: https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/thread/SHWAELL327NNOJEELAYPBQCDCCNFLLNS/
Related tasks: T294159 , T294637

If other people are interested in trying try the feature, please let us know after taking to your fellow Wiktionarists, and we can add you to our list of future deployments. You can also wait for the moment the Lua interface will become more stable to enable it on your wiki.

Mike_Peel unsubscribed.May 2 2022, 9:13 PM

In T212843#7566433, @Lea_Lacroix_WMDE wrote:

Hello all,

We are very happy to announce that this week we will deploy a first betaversion of Lua access to Lexemes on Bengali and Basque Wiktionaries!

Full announcement: https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/thread/SHWAELL327NNOJEELAYPBQCDCCNFLLNS/
Related tasks: T294159 , T294637

If other people are interested in trying try the feature, please let us know after taking to your fellow Wiktionarists, and we can add you to our list of future deployments. You can also wait for the moment the Lua interface will become more stable to enable it on your wiki.

Lea, hi! :) Can you please tell me how the testing of the functionality went? :)

Lydia_Pintscher closed subtask T294159: Enable Lexeme access on first set of projects as Resolved.Jul 15 2022, 9:39 AM

Lydia_Pintscher closed subtask T235901: Implement Lua access to Lexemes, Senses and Forms as Resolved.

Addshore unsubscribed.Jun 27 2023, 12:38 PM

[EPIC] Access to Wikidata's lexicographical data from Wiktionaries and other WMF sitesOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

[EPIC] Access to Wikidata's lexicographical data from Wiktionaries and other WMF sites
Open, Needs TriagePublic
Actions

Related Objects
Search...