We need to add more projects to http://intense.wmflabs.org/
An initial goal to close this task could be to add 100. That should allow making a list of blockers to make them 1000, then 10000, etc.
For simplicity, we must focus on projects which:
- use supported formats;
- follow ISO 639 codes in a way easily mappable to ours.
We will start looking where it's easiest, i.e. sites with many repositories and a dataset we can easily query to find suitable projects.
- http://github.com/ – tons, has a dump somewhere; cf. http://ghtorrent.org and http://2014.msrconf.org/challenge.php
- http://sourceforge.net/ – tons, used to have a dump; cf. [[ftp://ftp.mirrorservice.org/sites/dl.sourceforge.net/pub/sourceforge/|downloads]]
- http://svn.tuxfamily.org/ – 628 repos, probably approachable
- https://gitorious.org/ – many, ?
- http://bitbucket.org/ – many, probably no dump
- http://launchpad.net/ – many, no idea; last full archive of MO files: http://launchpadlibrarian.net/172477963/ubuntu-trusty-translations.tar.gz
- git.gnome.org is easy to clone; dataset DB overkill
- https://alioth.debian.org/ or https://sources.debian.net/ or https://codesearch.debian.net/perpackage-results/msgid%20path%3A.*%5C.pot/2/page_0
Currently excluded:
- https://savannah.gnu.org/ – 3600 (of which "452 Official GNU software")... but we need bzr support (or only for ~100 of them?); there is some Git hosting (729), a rsync mirror (300), some perhaps-complete FTP mirrors
On their own:
- openhub.net says they index 667k projects, but probably no interest in offering downloads; finds 34k pot files
- ddtp.debian.net is stuck in ~2001, they call Pootle a «new internationalisation framework» and have no structured l10n format
Old datasets (2005–2010) are often mentioned in research.