Page MenuHomePhabricator

InTense: Amass suitable projects from available datasets
Open, LowestPublic

Description

We need to add more projects to http://intense.wmflabs.org/
An initial goal to close this task could be to add 100. That should allow making a list of blockers to make them 1000, then 10000, etc.

For simplicity, we must focus on projects which:

  1. use supported formats;
  2. follow ISO 639 codes in a way easily mappable to ours.

We will start looking where it's easiest, i.e. sites with many repositories and a dataset we can easily query to find suitable projects.

Currently excluded:

On their own:

  • openhub.net says they index 667k projects, but probably no interest in offering downloads; finds 34k pot files
  • ddtp.debian.net is stuck in ~2001, they call Pootle a «new internationalisation framework» and have no structured l10n format

Old datasets (2005–2010) are often mentioned in research.

Event Timeline

Nemo_bis raised the priority of this task from to Medium.
Nemo_bis updated the task description. (Show Details)
Nemo_bis added a project: translatewiki.net.
Nemo_bis updated the task description. (Show Details)
Nemo_bis set Security to None.
Nemo_bis added a subscriber: Nemo_bis.

Reina et al. 2013 made me take notice of Damned Lies. Probably thanks to it, https://git.gnome.org/browse/ seems in a rather tidy shape, though not 100 % consistent. I'm currently cloning 579 repos on nemobis@ttmserver-mediawiki01:~/gnome; I'll then identify all "po" directories and pot files to start with and mass add them to InTense.

When this is done I'll publish the "sneak preview" blog post, so it will be the last call for review of it. :)

I have to figure out how to get or produce the pot file for GNOME projects. In the meanwhile I cloned all GNU git repos and added some.

In short I more or less did:

for repo in `cat gnu-repos`; do git clone git://git.savannah.gnu.org/$repo; done
rm -rf gettext/ gcl/ bash/ www-ja/ ocitysmap/ childsplay/ www-fr/
find -type d -name vendor -exec rm -rf {} +
find -type f -name '*.pot' > gnu-pot.txt
find -type f -name '*.po' | sed --regexp-extended 's/.+\/([^./]+).po/\1/g' | sort -u > languages
# manual cleanup of the languages

So I made the template P236 and by replacing the pattern ^./([^/]+)/(.*po)/([^/]+).pot$ I produced P237 which, with pagefromfile.py, should now add 36 more groups (python pwb.py scripts/pagefromfile.py -start:START -end:END -file:gnu-pot.txt -family:intense -lang:en -notitle).

To do after those:

./maposmatic/www/locale/django.pot
./scleaner/src/scleaner.pot
./gibbon/help/gibbon.pot
./freedink-data/dink/l10n/dink.pot
./bibledit-web/web/pot/bibledit.pot

I have to figure out how to get or produce the pot file for GNOME projects.

Running intltool-update --pot inside the /po subfolder is one option. It will create the file modulename.pot.

I have to figure out how to get or produce the pot file for GNOME projects.

Running intltool-update --pot inside the /po subfolder is one option. It will create the file modulename.pot.

Yep, that's what I did in the linked case, but the pot file wasn't recognised by Translate or something.

It is time to promote Wikimedia-Hackathon-2015 activities in the program (training sessions and meetings) and main wiki page (hacking projects and other ongoing activities). Follow the instructions, please. If you have questions, about this message, ask here.

Do you think this project could be suitable for Possible-Tech-Projects and Outreachy-Round-11?

No. Among other things, we don't have time this autumn.

@Nemo_bis: Do you still work on this, as this is assigned to you?

Aklapper lowered the priority of this task from Medium to Lowest.Nov 27 2019, 1:18 PM

Lowering priority to reflect the reality that currently nobody is working on InTense (plus its test website is down).

@Nemo_bis: Do you still work on this, as this is assigned to you?

Yes, although the motivation has decreased since we stopped exposing the data. Probably the right way to do it in 2020 is to work with https://www.softwareheritage.org/ , create a reusable method to extract the coordinates of the various repositories, publish it in a sufficiently clean format.

Then if/when InTense resumes the first output could be to import all these repositories and publish a dump of their parsed i18n, which would be quite useful for NLP purposes as well.

https://intense.wmflabs.org/ says "No proxy is configured for this host name."
Where is this located?

The labs project was deleted to free up unused resources until they are needed.

Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two years (see the emails sent to the task assignee on Oct27 and Nov23). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.
(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

Sounds like this task should be declined. Or depend on a "Set up intense on some servers" task.