Measure and indicate Lexeme language completeness, and prompt editors with what more might need doing
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Jdforrester-WMF
	Jan 28 2021, 8:26 PM

Description

Principal user story:

As a re-user of Lexeme content in all languages, I want comprehensive coverage in as many languages as possible.

Secondary user stories:

As a Lexeme editor, I want to help work on the most needed Lexemes in my language(s).
As a community organiser, I want to know in what languages I should advocate for more contributions.

We should provide first-party, prominent encouragement of productive user behaviour, by measuring Lexeme language completeness, and providing listings of missing/incomplete Lexemes for editors (or tools for editors) to use to guide their efforts.

Starting corpora could include:

existing Wikipedia / etc. articles in that language
Official "basic word lists", such as the five cited lists compiled at simple:Wikipedia:Basic English combined wordlist
Similar word frequency lists (perhaps via d:Q8893349 (Category:Linguistics lists))

Event Timeline

Jdforrester-WMF created this task.Jan 28 2021, 8:26 PM

Restricted Application added a project: Wikidata. · View Herald TranscriptJan 28 2021, 8:26 PM

Quiddity subscribed.Jan 28 2021, 10:29 PM

Quiddity updated the task description. (Show Details)Jan 28 2021, 10:59 PM

Jdforrester-WMF updated the task description. (Show Details)Jan 28 2021, 11:06 PM

Scott_WUaS subscribed.Feb 18 2021, 9:46 PM

Lea_Lacroix_WMDE added a project: Wikidata-Lexicodays-2021.Mar 12 2021, 8:53 AM

Denny's posted this notebook: https://public.paws.wmcloud.org/User:DVrandecic_(WMF)/Lexicographic%20coverage.ipynb which does pretty much the above for the language Wikipedia corpora. Results at https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage
However, I don't think he wants to keep running it, so can we move it somewhere that it will be run regularly by a bot or something? The coverage/completeness data is helpful, and the 'missing' page for each language is a great guide to editors, if it could be kept reasonably up to date.

In T273221#6925205, @ArthurPSmith wrote:

Denny's posted this notebook: https://public.paws.wmcloud.org/User:DVrandecic_(WMF)/Lexicographic%20coverage.ipynb which does pretty much the above for the language Wikipedia corpora. Results at https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage
However, I don't think he wants to keep running it, so can we move it somewhere that it will be run regularly by a bot or something? The coverage/completeness data is helpful, and the 'missing' page for each language is a great guide to editors, if it could be kept reasonably up to date.

That's a one-off report rather than a "first-party, prominent" dynamic message to users on the front page / etc. of Wikidata.org, though.

Hello, I read that you were intrested in other corpuses than Wikipedia. I think that Swedish Wikipedia is a skewed source since so many articles are started by robots, and the frequency of odd formulations remain high even after they are manually cleaned up. The Swedish Gigaword Corpus contains one billion words from 1950-2015 analyzed with NLP and stored in XML format: https://spraakbanken.gu.se/en/resources/gigaword
A presentation: http://www.ep.liu.se/ecp/126/002/ecp16126002.pdf

The license is CC-BY which is incompatible with Wikidata. But just like the Leipzig Corpora Collection it would be possible to extract missing word forms.

DVrandecic subscribed.Apr 28 2021, 11:51 PM

Measure and indicate Lexeme language completeness, and prompt editors with what more might need doingOpen, Needs TriagePublicActions

Description

Event Timeline

Measure and indicate Lexeme language completeness, and prompt editors with what more might need doing
Open, Needs TriagePublic
Actions