Page MenuHomePhabricator

Choose pilot languages for Abstract Wikipedia and improvements for Wikidata Lexicographic extensions
Closed, ResolvedPublic


What is the problem?

For Abstract Wikipedia and the improvements to Wikidata's Lexicographical data, we want to focus on a small number (3-4?) of pilot languages. These languages need to be chosen together with the communities, WMDE, and the Abstract Wikipedia team.

How can we help you?

Coordinate and facilitate the community decision.

What does success look like?

The pilot languages are chosen in a transparent manner.

What is your deadline?

(Didn't explicate with Lydia yet, she said "In a few weeks", so let's say end of March if that works for everyone?)

(Other tasks in the Abstract Wikipedia timeline may be moved to give Nick time for this)

Ongoing at

Event Timeline

Do let us know which tasks you want to de-prioritize!

Some thoughts I had upon learning of this task (which may or may not be agreed with, but so it may go):

  • The four languages should satisfy the following:
    • One should have a well-developed Wikipedia and have enough of an active userbase, that we can more readily compare the quality of the generated output against a wide variety of different base cases and get lots of feedback from that language community on the development of that particular language's implementations.
    • One should have a less-well-developed Wikipedia but have enough of an active userbase, that starting from the sorts of text examples available in that Wikipedia we can levy sufficient feedback from that language community on that language's implementations to generate more "complex passages" than may otherwise be present on that wiki (where "complex" depends on the current state of that wiki's typical pages).
    • One should either have a Wikipedia much less developed than that of the previous language (such as one recently created) or have an active Incubator project, that we can demonstrate the ability of this system to bootstrap new Wikimedia projects with preliminary text passages that can be used for manual expansion by other editors later if so desired.
    • One should not have a Wikipedia (and should either have an inactive Incubator project or have none at all), that we can demonstrate the ability of this system to represent knowledge anew in a language for which any sort of encyclopedic effort hasn't ever been sufficiently undertaken before.
  • The languages should all be rather disparate, in order to address as wide a range of issues that may crop up in other languages as possible. Just being geographically or genealogically separate may not be enough here.
  • At least one of these languages should not be written in the Latin script, not just to put some attention into script conversion tools (à la T32759 and friends) but also to avoid "cheating" (the temptation to simply draw in English terms wholesale to save time and effort on the implementation). If two of them do not use the Latin script, then the two types of writing systems should be different enough (e.g. not two alphabets or two abugidas).
  • At least one of these languages should have sufficient dialect variation, reflected either in the lexicographical data for that language or in that language's implementation, that it becomes conceivable to control the passages produced based on the desired dialect to be expressed.

Some sets of languages that in my view satisfy these include the following:

  • German, Bengali, Dagbani, Miami-Illinois
  • Swedish, Malayalam, Moroccan Arabic, Osage
  • French, Marathi, Southern Altai, Yucatec Maya
  • Danish, Swahili, Manipuri, Ainu
  • Hebrew, Quechua, Manchu, Southern Ndebele
  • Basque, Kurmanji, Atayal, Cherokee

(I may come back and revise these proposed criteria, at which point these possible sets may also change.)

Do let us know which tasks you want to de-prioritize!

Yes, we deprioritized T271242 in order to free up resources for this task.

[update] Currently ongoing at

Over the last month, text requests/outreach went out to various communication/news channels, including: abstract-wikipedia@ (and Abstract Wikipedia newsletter), wikidata@, langcom@, languages@, wiktionary-l@, african-Wikimedians@, wikimediaindia-l@, [[d:Wikidata:Project chat]], [[d:Wikidata talk:Lexicographical data]], [[m:Wikimedia Forum]], Telegram channels: Wikidata & Lexicographical data, and Wikifunctions, Wikidata social media accounts, Abstract Wikipedia social media accounts,

and 2 virtual meetings are happening within the 30 lexic-o-days schedule (one earlier this week, and one early next week).

DVrandecic raised the priority of this task from Medium to High.Apr 7 2021, 4:44 AM

The languages chosen are Bengali, Malayalam, Hausa, Igbo, and Dagbani as a stretch language.

Results here: