HomePhabricator

Script sqooping mediawiki tables into hdfs
caeb2bb51dbeUnpublished

Unpublished Commit · Learn More

Not On Permanent Ref: This commit is not an ancestor of any permanent ref.
This commit has been deleted in the repository: it is no longer reachable from any branch, tag, or ref.

Description

Script sqooping mediawiki tables into hdfs

For each wiki configured in a grouped wiki file,
Sqoop archive, page, user, revision, and logging from its mediawiki
database into hdfs.

Also, download the wikimedia project site matrix and annotate each
project with its namespace prefixes and translations.

To create a grouping of wikis, where each group could run in parallel
and pull a consistent amount of data from mediawiki, use the html
visualization in diagrams/group-wikis-by-estimated-size.html. This was
used to generate the grouped_wikis.csv file.

Bug: T141476
Change-Id: Id712979079a0d7a6263abcee83b6c7368aa5bf90

Details

Provenance
MilimetricAuthored on Jul 26 2016, 9:38 PM
Parents
rANRE983774d9b237: Add refinery-source jars for v0.0.34 to artifacts
Branches
Unknown
Tags
Unknown
ChangeId
Id712979079a0d7a6263abcee83b6c7368aa5bf90