Maniphest T202489

Copy monthly XML files from public-dumps to HDFS
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	JAllemandou
	Aug 22 2018, 7:52 AM

Details

Subject	Repo	Branch	Lines +/-
Move import_wikitext_dumps to stat1007	operations/puppet	production	+3 -1
Add timer importing page-history dumps to hadoop	operations/puppet	production	+46 -0
Add python script importing xml dumps onto hdfs	analytics/refinery	master	+437 -0
Update python/refinery/utils/HdfsUtils	analytics/refinery	master	+238 -32

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Milimetric	T186559 Provide data dumps in the Analytics Data Lake
		Resolved		JAllemandou	T202489 Copy monthly XML files from public-dumps to HDFS

Event Timeline

JAllemandou created this task.Aug 22 2018, 7:52 AM

JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.Aug 22 2018, 7:54 AM

Oo, ok, how do you do this usually? Can you list the paths and files and how you get them?

What I've done manually until now is referenced here: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/409960
I have started a python script to do pretty much the same thing with more flexibility and incremental growth.
Comments/ideas welcome :)

Cool, I wonder if we can somehow schedule this with oozie instead of cron. Can Oozie look for local path creation instead of HDFS? I would assume so.

I think the files dependencies are too complex for oozie here. Many projects !

Ah ok right. Not just a single new time period directory?

I think Apache NiFi is the usual way to move large local directories to HDFS.

@diego I agree we could use NiFi to copy the files, but it seems too much work in term of system setup for this single use-case.

JAllemandou claimed this task.Aug 22 2018, 3:09 PM

Milimetric triaged this task as High priority.Aug 23 2018, 4:04 PM

Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Change 456654 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] [WIP] Add python script importing xml dumps onto hdfs

https://gerrit.wikimedia.org/r/456654

gerritbot added a project: Patch-For-Review.Aug 31 2018, 3:52 PM

JAllemandou moved this task from In Progress to In Code Review on the Analytics-Kanban board.Sep 4 2018, 4:11 PM

Change 459780 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update python/refinery/utils/HdfsUtils

https://gerrit.wikimedia.org/r/459780

Change 459780 merged by Joal:
[analytics/refinery@master] Update python/refinery/utils/HdfsUtils

https://gerrit.wikimedia.org/r/459780

JAllemandou moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.Oct 5 2018, 3:56 PM

mforns set the point value for this task to 5.Oct 8 2018, 4:07 PM

MMiller_WMF unsubscribed.Oct 10 2018, 12:17 AM

Change 456654 merged by Joal:
[analytics/refinery@master] Add python script importing xml dumps onto hdfs

https://gerrit.wikimedia.org/r/456654

• Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Oct 31 2018, 4:15 PM

• Nuria moved this task from Done to Ready to Deploy on the Analytics-Kanban board.

Change 472472 had a related patch set uploaded (by Elukey; owner: Joal):
[operations/puppet@production] Add timer importing page-history dumps to hadoop

https://gerrit.wikimedia.org/r/472472

Change 472472 merged by Elukey:
[operations/puppet@production] Add timer importing page-history dumps to hadoop

https://gerrit.wikimedia.org/r/472472

Excuse me for butting in at this late date but these files are already available from labstore1006,7 to labs instances and on stats100? (I forget which one now). Do you need them to be available somewhere else?

I don't have any objections on principle to having another copy floating around, it would just be nice to make sure it's not redundant.

In T202489#4740221, @ArielGlenn wrote:

Excuse me for butting in at this late date but these files are already available from labstore1006,7 to labs instances and on stats100? (I forget which one now). Do you need them to be available somewhere else?

Hi @ArielGlenn, the reason for which we need the files on an-coord1001 is because it is the one machine responsible for crons/systemd-timers jobs in our infra while the other stat1005 machine is user-jobs oriented.
This task being about productionizing the import of the files, it involves the machine responsible for prod-jobs. Ok on your side?

I mean, it's fine, but maybe it's better to just provide them as is done on stat100? (5? 7?) via nfs mount from labstore1006 (7?). Looping @Bstorm in for her opinion, as she is one of the point people for those servers. I don't mean to slow this down at all, just if there's a simpler solution than copy them over, maybe we should go for it.

@ArielGlenn, they need to be copied into HDFS inside of Hadoop, not just available on a regular filesystem.

@JAllemandou, I think it would be fine to schedule the load of the dumps into HDFS from a stat box instead of an-coord1001, if @elukey is ok with it.

@Ottomata , @ArielGlenn - I'm ok with copying the dumps from a stat machine.
Let's see what @elukey thinks of it

Change 473161 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move import_wikitext_dumps to stat1007

https://gerrit.wikimedia.org/r/473161

Change 473161 merged by Elukey:
[operations/puppet@production] Move import_wikitext_dumps to stat1007

https://gerrit.wikimedia.org/r/473161

JAllemandou moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Nov 13 2018, 8:21 AM

• bmansurov mentioned this in T209655: Copy Wikidata dumps to HDFS + parquet.Nov 15 2018, 10:22 PM

• Nuria closed this task as Resolved.Dec 12 2018, 8:36 PM

Copy monthly XML files from public-dumps to HDFSClosed, ResolvedPublic5 Estimated Story PointsActions

Details

Related ObjectsSearch...

Event Timeline

Copy monthly XML files from public-dumps to HDFS
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...