Page MenuHomePhabricator

[Epic] querying for lexicographical data
Closed, ResolvedPublic

Description

We want to be able to query lexicographical data on query.wikidata.org to find out all the things.

Open questions:

  • Do we add Lexemes to the main Wikidata dump or make a separate dump (or both)?
    • We may need separate dump at least for initial data load.

Open TODOs (that are not in subtasks)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think lex data dumps should be available independently of the other Wikidata data. For example, https://sklonenie-slov.ru/ shows all Russian noun declensions (30,000+), and I think such sites can greatly benefit from the community work.
P.S. I have began a discussion with the site authors, trying to get them to donate their database to Wikidata.

@Yurik Yes, we're going in that direction, also because having items+Lexemes in one dump would be waaay too big :)

Thanks for your work! If you need any support in your discussion with this organization, feel free to contact my colleague @johl (jens.ohlig@wikimedia.de) who's expert in partnerships and data donations.

Right now full lexeme dump is just 2.1M compressed, so adding it to main dump would not be a big deal for dump size. However, absent the separate dump, you'd have to always download the huge one, of course. Which makes me still support the separate dump route.