Page MenuHomePhabricator

Provide a Wikibase instance where we can import Wikitionaries materials and that can be queried from Wiktionaries
Open, Needs TriagePublic


Although Wikidata already provides a way to carry some lexicological data through the Lexeme extension, we lake the possibility to import existing definitional material from wiktionaries into a relational database queryable from all wikimedia projects.

NOTE: Wikitionaries are CC-BY-SA licensed so importing its definitions into Wikidata is not permitted by the license. This ticket is oriented toward its aim of a distinct Wikibase instance to store definitions taken from Wikitionaries, it is not a place to discuss Wikidata license policy, thank you in advance for staying focus.

The asked Wikibase instance aims the following features:

  • allow to import definitions from Wiktionaries
  • allow to link each definition to samples of use (including text quote of course, but also links to media on Commons for audio records, video of signed languages and so on, media should be associated to a textual transcription)
  • allow to translate definitions (and transcriptions where relevant)
  • allow to add statements on each definition (for example domain/context where the definition applies, register, and so on)
  • allow to edit items through an API that ease contributions with external tools such as a toolforge service (this would probably be a huge plus in term of new contributor attraction as currently editing Wiktionaries is difficult for newcomers with many implicit structures)
  • allow Wikitionaries (and other sister projects) to query directly the instance

The instance would NOT aim at:

  • incorporate grammatical paradigms (each definition might be linked to one or several lexical items, itself linked to zero or several example of use, but inflections should not be part of the model)

Event Timeline

So to answer a few questions I received in feedback to this ticket:

  • The reason for this tickets:
    • enable better sharing and coordination between linguistic versions, including
      • translations of definition
      • sharing the base of samples showing use of words linked to a specific definition
      • sharing etymologies written in prose (as opposed to explicit etymological relational trees)
    • bring more flexibility to the reuse of definitions inside Wikimedia wikis, allowing to show the same definition at several places/in different manners.
      • An example would be to create glossaries that gather on a single page terms and definitions of a topic, reusing the same definitions that are on each term page, restricted to those pertaining the topic.
      • An other possible use would be to have not only the lemma article having definitions, but also each inflections as long as the definition holds, while only displaying examples of use that pertains to the queried form.
      • this are only examples, possibilities of use are open, and this specific possibilities are only illustrative, in any case communities decide what and how they want to use.
    • ease the use by external projects (through a query service, an unified structured downloadable dump, etc.)
    • we can't import definition of Wiktionaries in Wikidata due to license incompatibility, but this won't be a problem with a separated instance using an appropriate license, so we can benefit of both the power and flexibility of Wikibase and the already large knowledge base of Wiktionaries together
  • Focus of efforts and resources
    • with such a project, existing community wouldn't have to change anything if they don't want, but would have the possibility to do so
    • it would allow to more easily share data across linguistic versions
    • it would allow to develop nicer interfaces on top of such a Wikibase instance, including within the Wikitionaries, but also in Toolforge for example, while having the whole resulting data base accessible from everywhere in the Wikimedia infrastructure per query facilities
      • that means possibilities to create more easy to use interfaces, and lowering the barrier to start contributing
    • the ticket is about a single Wikibase instance, dedicated to definition materials
      • it's not about one instance for each existing Wikitionary
    • the same result can not be reached through downstream structuration like Dbnary et GLAWI but both could possibly be used to populate the Wikibase instance

Thank you for those that already pointed lake in my initial demand, I hope it helps to clarify a bit.

Some use cases inspiring the demand

As an example of the limits perceived with our current infrastructure, even staying in the frame of a single Wiktionary, you can think about creating glossaries using existing definitions. An other example is that, unlike traditional paper dictionaries, wiktionaries could give definition of words in every articles, rather than pointing to the lemma where definitions are stored: being able to reuse the same definition at several points might also be useful in such a scenario.

Current (un)possibilities to implement that in the current infrastructure

Maybe Extension:TextExtracts might help somewhat here, but at least from the documentation it doesn't appear to be usable through wikicode calls. A more invasive solution would be to use Extension:Labeled Section Transclusion which requires to explicitly mark each element that should be extractable.

One way to do it in the current state of available services would be to use external bots that browse lexical categories, fetch each matching article, grab matching definitions (if tagged in the article) and generate distinct pages, for example in a dedicated namespace.

NOTE: As Lua modules don't allow to fetch a list of elements in a category, it's not possible to do the previous transformation through modules alone, and it would probably be too resource consuming anyway.

What would also be possible is parsing all articles, as previously mentioned, and put their definitions into data modules. That would at least make a scenario where this data would be easily providable for in-wiki consumption in various cases, potentially avoiding data duplication within a single instance, as well as for external queries as the module could generate various output format.

Thus said there is no guarantee that the current community would like to use this data modules, there might propose to delete the transformed material, just ignore it and recommend to not use it within the main space, just as well as embrace and migrate massively to such an approach. Also such an approach shouldn't go without a rethinked UX which enable to edit data modules without editing the LSON, with the aim of gaining the support of both current communities and new comers.

But even if we would have all that, it wouldn't allow data sharing across linguistic version, as we currently don't have possibility to share modules across instances.


Community endorsement won't change whether the data are stored within a data module or in an external Wikibase instance. The sharing of data across wikis can't currently be solved with data modules, but Wikidata comes to our mind of course. However as this is about sharing existing definitions of Wiktionaries which are covered by CC-by-sa-3.0-unported, Wikidata which accept only CC-0 compatible material can't host this data.

This hopefully exposes the reasons of this demand and makes obvious how it supports the aims of our movement.

Lydia_Pintscher subscribed.

I'm sorry, @Psychoslave, but we are not going to set up a second Wikibase instance to basically do what can be done with the lexicographical data support available on Wikidata. It'd be a waste of valuable resources.

Hi @Lydia_Pintscher , I join your concern on not wasting valuable resources.

I would be happy to know how we could achieve the the exposed goals with the lexicographical data support available on Wikidata, including the very first point "allow to import definitions from Wiktionaries".

I hope we can agree that Wiktionaries communities and the work they achieved so far including the definition set they collectively created can also be considered as valuable resources that we shouldn't waste, do we all agree on that?

I would like that we leverage on this resources to make the best out of the synergy between the Wikibase technologies and Wiktionary communities and works. If this is not through the launch of a distinct instance of Wikibase, where we could better coordinate existing valuable resources, any other implementable solution proposal is welcome, of course.

I re-open the task, as from my understanding of the documentation this is a legit action from my part as the task creator, all the more when questions are pending.

Please let me know if I misinterpreted the documentation and that is perceived as an impolite behaviour, this is not my intention, I'm just not familiar enough with local habits and customs regarding the change of status and what a closed status might change or not in term of visibility, and so on, of the ticket. Pointers to documentations I might have missed is welcome.

I support the idea of experimenting a separate Wikibase instance for Wiktionaries texts.

Something with all Wiktionaries content including full text material such as definitions, examples and etymology, as there are finely written in Wiktionaries.

Some people outside of the wikiverse want to reuse definitions and etymologies in prose (like website that compile dictionaries such as and there will be none in Wikidata (only glosses, paradigms and semantic networks). This proposal of a separate instance will resolve this issue and participate to show the power of Wikibase to shape datas and help the querying.

It could be a support for Wiktionary and wiktionarians, when Wikidata is not yet a support to Wiktionaries but rather a separate project aiming other purposes for other customers. The kind of support Psychoslave mentioned in the first message is not planed by Wikidata product team and both instances could be complementary. I know there is no Wiktionaries product manager, I am sad about that, but if one exist now, it should be the person to which this proposal is directed.

It could be a reshape/conversion of Wiktionaries dumps into a Wikibase format in read-only mode, only allowing to query and explore the content. This could be amazing and very different from Wikidata LexData. It could help a lot the diffusion of Wiktionaries content.

Finally, I know for sure Wiktionaries data can be integrated in databases easily, because scripts already exists like Anagrimes (by @Darkdadaah for French), GLAWI (for French) and Dbnary (for twelve languages). So, a separate instance under CC BY-SA could be filled with not so much investment in time and there will be voluntaries to do so, as the scope and goal is clearly defined.

I am convince WikidataLex (or any name this instance would have) is a robust project with a clear scope.

(as my language is French, I resumed my opinion in French in French Wiktionary Beer Parlour.)

NOTE: Mapping Wikidata to other ontologies might have relevance at some point once we will have the instance running and populated. We are still far from that, but this is just to keep a trace for once it's done, as a possible further work for integration with the rest of Wikimedia infrastructure.