We need to add more projects to http://intense.wmflabs.org/
An initial goal to close this task could be to add 100. That should allow making a list of blockers to make them 1000, then 10000, etc.
For simplicity, we must focus on projects which:
# use [[https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Extension:Translate/File_format_support|supported formats]];
# follow ISO 639 codes in a way easily mappable to ours.
We will start looking where it's easiest, i.e. sites with many repositories and a dataset we can easily query to find suitable projects.
* http://github.com/ – tons, has a dump somewhere; cf. http://ghtorrent.org and http://2014.msrconf.org/challenge.php
* http://sourceforge.net/ – tons, used to have a dump
* http://svn.tuxfamily.org/ – 628 repos, probably approachable
* https://gitorious.org/ – many, ?
* http://bitbucket.org/ – many, probably no dump
* http://launchpad.net/ – many, no idea; last full archive of MO files: http://launchpadlibrarian.net/172477963/ubuntu-trusty-translations.tar.gz
* git.gnome.org is easy to clone; [[https://bitbucket.org/mgoeminne/sgl-flossmetric-dbmerge|dataset DB overkill]]
* https://alioth.debian.org/
Currently excluded:
* https://savannah.gnu.org/ – 3600 (of which "452 Official GNU software")... but we need bzr support (or [[http://bzr.savannah.gnu.org/r/|only for ~100 of them]]?); there is [[https://savannah.gnu.org/maintenance/UsingGit/|some Git hosting]] (729), a [[https://www.gnu.org/server/mirror.html|rsync mirror]] (300), some [[http://download.savannah.gnu.org/mirmon/savannah/|perhaps-complete]] FTP mirrors
On their own:
* openhub.net says they index 667k projects, but probably no interest in offering downloads; [[https://code.openhub.net/search?s=pot&pp=0&fe=pot&ff=1&mp=1&ml=0&me=0&md=1&filterChecked=true|finds 34k pot files]]
* ddtp.debian.net is stuck in ~2001, they call Pootle a «new internationalisation framework» and have [[http://ftp.debian.org/debian/dists/sid/main/i18n/|no structured l10n format]]
[[http://books.google.it/books?hl=it&lr=&id=mQRJTi08npwC&oi=fnd&pg=PA24|Old datasets (2005–2010)]] are often mentioned in research.