Page MenuHomePhabricator

Create basic Mirandese analysis chain
Closed, ResolvedPublic

Description

The goal is to create a basic Mirandese (mwl) analysis chain, with elision processing for d', l', and qu' and a basic stop word list, and set it up running on Mirandese Wikipedia data on RelForge.

See discussion on Mirandese Village Pump for more. @Athena has created a Mirandese stop word list (adapted from a Portuguese one), available on GitHub.

Once it is up on RelForge, if everything looks good, we'll work on getting it deployed to prod and then reindex the Mirandese Wikipedia!

Related Objects

StatusSubtypeAssignedTask
ResolvedTJones
ResolvedTJones

Event Timeline

We have a working prototype on RelForge! Note that the prototype only includes the index, not the content of the Mirandese Wikipedia, so all links on the search results page are red. It's running in WMF Labs, so it has the unicorn logo instead of the Wikipedia logo.

The elision processing (handling l', d', and qu') allows for more recall: searching for acupa in prod gives 115 results, in labs it gives 122. Searching for l'acupa in prod gives 0 results, in labs it gives the same 122 results.

The stop words improve recall and change scoring and thus ranking of results. In prod, la almanha gives 278 results. In labs, it gets 282 results, and the article "Seclo XX" moves up from 6th on the list to 3rd, presumably because it has more matches to almanha, and la doesn't add to the full text scoring.

Change 441253 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Create basic Mirandese analysis chain

https://gerrit.wikimedia.org/r/441253

Change 441253 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Create basic Mirandese analysis chain

https://gerrit.wikimedia.org/r/441253

TJones renamed this task from ascaaaaaaa to Create basic Mirandese analysis chain.Jul 2 2018, 2:56 PM
TJones claimed this task.
TJones raised the priority of this task from High to Needs Triage.
TJones updated the task description. (Show Details)
TJones added subscribers: Aklapper, Gerrit.
TJones edited subscribers, added: GerritBot; removed: Gerrit.
TJones edited subscribers, added: gerritbot; removed: GerritBot.

So many things labelled "gerrit" and "gerritbot". Sorry for the extra notifications.

debt subscribed.

Closing this as it rides the train this week. The follow-up ticket is T194941 to re-index the Mirandese Wikipedia site.