Page MenuHomePhabricator

Develop data loading scripts for the Research Cluster (altiscale)
Closed, ResolvedPublic

Description

This task is done when we have a single-command ETL for converting a new XML dump into a query-able set of HIVE tables.

On the altiscale "Research" cluster.

See current work here: https://github.com/wikimedia-research/research-cluster

See old notes here: https://etherpad.wikimedia.org/p/research_cluster_loading

Full process: [XML Dump] --> [JSON files] --> [Hive Table] --> [Metadata Hive Table]

[XML Dump] --> [JSON files] is handled by dump2revdocs.py

[Hive Table] & [Metadata Hive Table] is handled by some HiveQL scripts

Event Timeline

Halfak claimed this task.
Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak added a project: Research.
Halfak moved this task to In Progress on the Research board.
Halfak added subscribers: Halfak, JAllemandou.

I finally got python installed on the workbench! (Every time I tried over the last month, the workbench's drive was full and I couldn't do anything)

Next steps are tests with real data.

Halfak set Security to None.
Halfak updated the task description. (Show Details)
Halfak updated the task description. (Show Details)
Halfak added a subscriber: schana.

I just gave a quick intro to @schana and we worked for a little while on getting the script up and running. See our work here: https://github.com/wikimedia-research/research-cluster

@JAllemandou, it seems like this is done now. What do you think?

ggellerman reassigned this task from schana to JAllemandou.
ggellerman moved this task from In Progress to Done (current quarter) on the Research board.

Agreed @Halfak , first version is done and merged !