Wished output is raw texts, without wiki-code and with no or minimal templates.
This for all pages of each wiki.
The API provides prop=extracts
Is there something cleaner and on dump format ?
Would help corpus linguistic, human language studies, creations of wordlists, Wikidata lexeme and #Lingualibre.
The toolchain used in T127793 could likely be reused here.
See also :
- https://opus.nlpl.eu/wikimedia.php : example of Research center processing Wikipedia dump to provide texts dump with partial cleaning.