We need to add more projects to http://intense.wmflabs.org/
An initial goal to close this task could be to add 100. That shoud allow making a list of blockers to make them 1000, then 10000, etc.
For simplicity, we must focus on projects which:
# use [[https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Extension:Translate/File_format_support|supported formats]];
# follow ISO 639 codes in a way easily mappable to ours.
We will start looking where it's easiest, i.e. sites with many repositories and a dataset we can easily query to find suitable projects.
* http://github.com/ – tons, has a dump somewhere, not super-easy to find or maybe not public
* http://sourceforge.net/ – tons, used to have a dump
* http://svn.tuxfamily.org/ – 628 repos, probably approachable
* https://gitorious.org/ – many, ?
* http://bitbucket.org/ – many, probably no dump
* http://launchpad.net/ – many, no idea
* [[https://bitbucket.org/mgoeminne/sgl-flossmetric-dbmerge|git.gnome.org dataset]]
Currently excluded:
* https://savannah.gnu.org/ – 3600, [[https://www.gnu.org/server/mirror.html|can just rsync it]]... but no bzr support yet
On their own:
* openhub.net says they index 667k projects, but probably no interest in offering downloads
* ddtp.debian.net is stuck in ~2001, they call Pootle a «new internationalisation framework» and have [[http://ftp.debian.org/debian/dists/sid/main/i18n/|no structured l10n format]]
[[http://books.google.it/books?hl=it&lr=&id=mQRJTi08npwC&oi=fnd&pg=PA24|Old datasets (2005–2010)]] are often mentioned in research.