Page MenuHomePhabricator

Lexical Data user scenario: Exporting Lexical Data to create morphology and stemming for search engines
Open, Needs TriagePublic

Description

The English language has very simple morphology, and this makes it relatively easy to build search engines that can find different forms of a word with no effort from the end user.

Many other languages have a complex morphology with declinations, conjugations, clitics, agglutination, etc. Some search engines can plug in morphology and stemming support for particular languages. Support for each language must be developed and maintained independently.

When Wikidata's Lexical Data is able to create all declined forms of a word, the output can be reused in both ways to build stemming engines: To find the base form (or forms) of a word from a declined form, and to find the declined forms from a base form. Wikibase should provide APIs that make such usage as easy as possible.

Notes:

  • Like other subtasks of T186421, this is not a particular bug, but an idea for how Lexical Data can be useful in the long term. I am filing it in the hope that knowing the possible user scenarios will be useful to Wikibase developers when they are making decisions about developing the infrastructure, and to Wikidata community members when they are proposing properties, developing bots, and so on.
  • This is comparable to T186429 and T186420, but for search engines.
  • I'm subscribing @TJones and @Smalyshev, who know far more about stemming engines than I do.

Event Timeline

Amire80 created this task.May 24 2018, 8:24 AM

Thanks for this ticket, @Amire80! I've been thinking about building a tool to scrape lexical info from Wiktionary in order to feed it into a statistical model to build stemmers for languages that don't have them. Getting it from Wikibase would probably be a lot easier. I'm not sure whether an API would be able to keep up with the rate required for stemming while indexing wikis, but it would still be an awesome tool overall for other stemming applications that have lower throughput requirements. (An induced stemmer model, whether statistical or rule-based, would also be able to handle novel forms.)

Amire80 updated the task description. (Show Details)May 24 2018, 1:11 PM

Thanks for this ticket, @Amire80!

Thank you for the comment :)

I'm not sure whether an API would be able to keep up with the rate required for stemming while indexing wikis, but it would still be an awesome tool overall for other stemming applications that have lower throughput requirements. (An induced stemmer model, whether statistical or rule-based, would also be able to handle novel forms.)

Well, that's exactly why I create these tickets: So that relevant people would start thinking about future uses as early as possible. For example, we could fantasize about a stemming engine that has an interface to Lexical Wikidata and gets continuously updated from it, and the developers could get the technical infrastructure to support the needed performance.

Vvjjkkii renamed this task from Lexical Data user scenario: Exporting Lexical Data to create morphology and stemming for search engines to 8dcaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
TJones renamed this task from 8dcaaaaaaa to Lexical Data user scenario: Exporting Lexical Data to create morphology and stemming for search engines.Jul 2 2018, 3:00 PM
TJones raised the priority of this task from High to Needs Triage.
TJones updated the task description. (Show Details)
TJones added a subscriber: Aklapper.
Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Jan 4 2019, 10:29 AM