Create monthly raw text dumps for Corpus linguistic
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Yug
	Mar 7 2021, 9:20 PM

Description

Wished output is raw texts, without wiki-code and with no or minimal templates.
This for all pages of each wiki.

The API provides prop=extracts

Is there something cleaner and on dump format ?
Would help corpus linguistic, human language studies, creations of wordlists, Wikidata lexeme and #Lingualibre.

The toolchain used in T127793 could likely be reused here.

Related Objects

Mentioned Here: T273585: Host OKAPI HTML dumps on public-facing labstore servers
T127793: Create Content Translation Parallel Corpora dumps

Event Timeline

Yug created this task.Mar 7 2021, 9:20 PM

Are you familiar with the work of the OKAPI team? T273585 might be of interest to you.

Create monthly raw text dumps for Corpus linguisticOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Create monthly raw text dumps for Corpus linguistic
Open, Needs TriagePublic
Actions