Page MenuHomePhabricator

Prepare plain-text Wikipedia dumps for Språkbanken
Closed, ResolvedPublic

Description

[Språkbanken](http://språkbanken.no) is a state-run database for linguists and language technology developers. They have approached us to request plain-text dumps of Wikipedia in Norwegian Bokmål, Norwegian Nynorsk and Northern Sami. Since For their purposes the normal dumps aren't very useful, so we are making an alternative system to get these texts for them.

  • Get specifications for how the data should be formatted
  • Create script to get the dumps
  • Publish script on GitHub (T218990)