Page MenuHomePhabricator

Measure and indicate Lexeme language completeness, and prompt editors with what more might need doing
Open, Needs TriagePublic

Description

Principal user story:

  • As a re-user of Lexeme content in all languages, I want comprehensive coverage in as many languages as possible.

Secondary user stories:

  • As a Lexeme editor, I want to help work on the most needed Lexemes in my language(s).
  • As a community organiser, I want to know in what languages I should advocate for more contributions.

We should provide first-party, prominent encouragement of productive user behaviour, by measuring Lexeme language completeness, and providing listings of missing/incomplete Lexemes for editors (or tools for editors) to use to guide their efforts.

Starting corpora could include:

Event Timeline

ArthurPSmith added a subscriber: ArthurPSmith.

Denny's posted this notebook: https://public.paws.wmcloud.org/User:DVrandecic_(WMF)/Lexicographic%20coverage.ipynb which does pretty much the above for the language Wikipedia corpora. Results at https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage
However, I don't think he wants to keep running it, so can we move it somewhere that it will be run regularly by a bot or something? The coverage/completeness data is helpful, and the 'missing' page for each language is a great guide to editors, if it could be kept reasonably up to date.

Denny's posted this notebook: https://public.paws.wmcloud.org/User:DVrandecic_(WMF)/Lexicographic%20coverage.ipynb which does pretty much the above for the language Wikipedia corpora. Results at https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage
However, I don't think he wants to keep running it, so can we move it somewhere that it will be run regularly by a bot or something? The coverage/completeness data is helpful, and the 'missing' page for each language is a great guide to editors, if it could be kept reasonably up to date.

That's a one-off report rather than a "first-party, prominent" dynamic message to users on the front page / etc. of Wikidata.org, though.

Hello, I read that you were intrested in other corpuses than Wikipedia. I think that Swedish Wikipedia is a skewed source since so many articles are started by robots, and the frequency of odd formulations remain high even after they are manually cleaned up. The Swedish Gigaword Corpus contains one billion words from 1950-2015 analyzed with NLP and stored in XML format: https://spraakbanken.gu.se/en/resources/gigaword
A presentation: http://www.ep.liu.se/ecp/126/002/ecp16126002.pdf

The license is CC-BY which is incompatible with Wikidata. But just like the Leipzig Corpora Collection it would be possible to extract missing word forms.