Page MenuHomePhabricator

Enable missing words collection for Wikimedia's Apertium server
Open, MediumPublic

Description

Apertium identifies words that it cannot translate and has an ability to log it. We should consider collecting this information and sending it to Apertium developers.

Steps:

  1. Package python-toro (See: T101947)
  2. Determine location for missingFreqs.db and access to it (It is sqlite DB).
  3. Puppet config.
  4. Deployment in Beta and Production.

From #apertium on IRC:
aharoni 2. How can Wikipedia help Apertium improve this? Can we report the most frequent missing words, for example?
TinoDidriksen Unhammer and jacobEo, the currently online maintainers of dan-nor; what say you?
aharoni I've been thinking how to report untranslated words from Wikipedia back to Apertium
TinoDidriksen Well, APY keeps a database of untranslated words, with frequency afaik.
aharoni Where is it collected?
TinoDidriksen Some SQLite db on the APY host.
aharoni [ hi kart_ ]
aharoni If we have our own package installed, do we already collect it?
aharoni kart_ handles all the packaging for us, I don't know the technical details.
TinoDidriksen Don't know what version you have packaged, or whether it has that part enabled.
aharoni kart_: do you know?
TinoDidriksen File is called missingFreqs.db in the APY folder.
aharoni OK, let's say that we do have it.
aharoni If we periodically send it to Apertium, will it be useful?
aharoni Will somebody bother to add the translations?
kart_ TinoDidriksen: you mean -apy?
kart_ TinoDidriksen: I think I need to update package then.
kart_ aharoni: ^^
kart_ aharoni: can I have task in Phab? :D
aharoni kart_: ack
aharoni TinoDidriksen: you know, you could just run Apertium over a dump of all Wikipedia articles and collect the most frequent untranslated words :)
kart_ aharoni: how to access is another subject, as we do run it on production service.
aharoni If you haven't already :)
aharoni kart_: How about just copying it once a month and emailing it to an Apertium contact :)
TinoDidriksen Whether anyone will care to look most missing words is a whole other story. I guess it's good incentive because there is a direct feedback loop.
aharoni It shouldn't be too big for email.
TinoDidriksen Ours is 130MB currently.
aharoni TinoDidriksen: If there is somebody who will care and add the translations, I'd gladly provide it.
TinoDidriksen Nobody is even looking at our own, currently...but it also hasn't been advertised to the mailing list. We should do that.
kart_ TinoDidriksen: Please do.
kart_ Even I came to know today, we should have send feedback earlier.
kart_ TinoDidriksen: can location of db configurable?
TinoDidriksen I don't maintain any of the Python code. Unhammer and sushain handle APY. But I can only assume the answer is yes, 'cause that sounds trivial.
TinoDidriksen Oh, it already is, with -f
TinoDidriksen There was also a cmdline flag to make it keep an in-memory buffer, so that it doesn't hog I/O with SQLite commits: -M1000
Unhammer hi
Unhammer yeah, probably good idea to use -M1000 (or some number like that)
Unhammer and yeah I'd like seeing the wp missingfreqs, it's probably more directly useful

Event Timeline

Amire80 raised the priority of this task from to Needs Triage.
Amire80 updated the task description. (Show Details)
Amire80 added subscribers: Amire80, KartikMistry.
Amire80 triaged this task as Medium priority.Mar 5 2015, 8:36 AM
Amire80 set Security to None.
KartikMistry renamed this task from enable missing words collection for Wikimedia's Apertium server to Enable missing words collection for Wikimedia's Apertium server.Apr 14 2015, 11:30 AM
KartikMistry updated the task description. (Show Details)
KartikMistry added a subscriber: santhosh.

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)