Page MenuHomePhabricator

Create monthly raw text dumps for Corpus linguistic
Open, Needs TriagePublic

Description

Wished output is raw texts, without wiki-code and with no or minimal templates.
This for all pages of each wiki.

The API provides prop=extracts

Is there something cleaner and on dump format ?
Would help corpus linguistic, human language studies, creations of wordlists, Wikidata lexeme and #Lingualibre.

The toolchain used in T127793 could likely be reused here.

See also :

Event Timeline

Are you familiar with the work of the OKAPI team? T273585 might be of interest to you.