Maniphest T120983

Develop data loading scripts for the Research Cluster (altiscale)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Dec 9 2015, 6:50 PM

Description

This task is done when we have a single-command ETL for converting a new XML dump into a query-able set of HIVE tables.

On the altiscale "Research" cluster.

See current work here: https://github.com/wikimedia-research/research-cluster

See old notes here: https://etherpad.wikimedia.org/p/research_cluster_loading

Full process: [XML Dump] --> [JSON files] --> [Hive Table] --> [Metadata Hive Table]

[XML Dump] --> [JSON files] is handled by dump2revdocs.py

[Hive Table] & [Metadata Hive Table] is handled by some HiveQL scripts

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		leila	T123688 Onboarding Nathaniel: software engineering tasks
		Resolved		JAllemandou	T120983 Develop data loading scripts for the Research Cluster (altiscale)

Event Timeline

Halfak created this task.Dec 9 2015, 6:50 PM

Halfak claimed this task.

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added a project: Research.

Halfak moved this task to In Progress on the Research board.

Halfak added subscribers: Halfak, JAllemandou.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 9 2015, 6:50 PM

I finally got python installed on the workbench! (Every time I tried over the last month, the workbench's drive was full and I couldn't do anything)

Next steps are tests with real data.

Halfak added a parent task: T123688: Onboarding Nathaniel: software engineering tasks.Jan 21 2016, 11:24 PM

Halfak updated the task description. (Show Details)Jan 21 2016, 11:33 PM

Halfak set Security to None.

Halfak reassigned this task from Halfak to • schana.Jan 26 2016, 3:09 PM

Halfak updated the task description. (Show Details)

Halfak added a subscriber: • schana.

Halfak updated the task description. (Show Details)Jan 26 2016, 3:12 PM

I just gave a quick intro to @schana and we worked for a little while on getting the script up and running. See our work here: https://github.com/wikimedia-research/research-cluster

First version of a dump_download script --> https://github.com/wikimedia-research/research-cluster/pull/1

• DarTar added a project: Measuring-value-added.Feb 18 2016, 8:45 PM

Kenrick95 moved this task from In Progress to Done (current quarter) on the Research board.Mar 1 2016, 12:45 AM

Kenrick95 moved this task from Done (current quarter) to In Progress on the Research board.Mar 1 2016, 12:21 PM

• DarTar triaged this task as Medium priority.Mar 17 2016, 10:29 PM

@JAllemandou, it seems like this is done now. What do you think?

• ggellerman closed this task as Resolved.Apr 28 2016, 10:27 PM

• ggellerman reassigned this task from • schana to JAllemandou.

• ggellerman moved this task from In Progress to Done (current quarter) on the Research board.

Agreed @Halfak , first version is done and merged !

Develop data loading scripts for the Research Cluster (altiscale)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Develop data loading scripts for the Research Cluster (altiscale)
Closed, ResolvedPublic
Actions

Related Objects
Search...