Page MenuHomePhabricator

Copy monthly XML files from public-dumps to HDFS
Closed, ResolvedPublic5 Estimated Story Points

Event Timeline

Oo, ok, how do you do this usually? Can you list the paths and files and how you get them?

What I've done manually until now is referenced here: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/409960
I have started a python script to do pretty much the same thing with more flexibility and incremental growth.
Comments/ideas welcome :)

Cool, I wonder if we can somehow schedule this with oozie instead of cron. Can Oozie look for local path creation instead of HDFS? I would assume so.

I think the files dependencies are too complex for oozie here. Many projects !

Ah ok right. Not just a single new time period directory?

I think Apache NiFi is the usual way to move large local directories to HDFS.

@diego I agree we could use NiFi to copy the files, but it seems too much work in term of system setup for this single use-case.

Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Change 456654 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] [WIP] Add python script importing xml dumps onto hdfs

https://gerrit.wikimedia.org/r/456654

Change 459780 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update python/refinery/utils/HdfsUtils

https://gerrit.wikimedia.org/r/459780

Change 459780 merged by Joal:
[analytics/refinery@master] Update python/refinery/utils/HdfsUtils

https://gerrit.wikimedia.org/r/459780

mforns set the point value for this task to 5.Oct 8 2018, 4:07 PM

Change 456654 merged by Joal:
[analytics/refinery@master] Add python script importing xml dumps onto hdfs

https://gerrit.wikimedia.org/r/456654

Change 472472 had a related patch set uploaded (by Elukey; owner: Joal):
[operations/puppet@production] Add timer importing page-history dumps to hadoop

https://gerrit.wikimedia.org/r/472472

Change 472472 merged by Elukey:
[operations/puppet@production] Add timer importing page-history dumps to hadoop

https://gerrit.wikimedia.org/r/472472

Excuse me for butting in at this late date but these files are already available from labstore1006,7 to labs instances and on stats100? (I forget which one now). Do you need them to be available somewhere else?

I don't have any objections on principle to having another copy floating around, it would just be nice to make sure it's not redundant.

Excuse me for butting in at this late date but these files are already available from labstore1006,7 to labs instances and on stats100? (I forget which one now). Do you need them to be available somewhere else?

Hi @ArielGlenn, the reason for which we need the files on an-coord1001 is because it is the one machine responsible for crons/systemd-timers jobs in our infra while the other stat1005 machine is user-jobs oriented.
This task being about productionizing the import of the files, it involves the machine responsible for prod-jobs. Ok on your side?

I mean, it's fine, but maybe it's better to just provide them as is done on stat100? (5? 7?) via nfs mount from labstore1006 (7?). Looping @Bstorm in for her opinion, as she is one of the point people for those servers. I don't mean to slow this down at all, just if there's a simpler solution than copy them over, maybe we should go for it.

@ArielGlenn, they need to be copied into HDFS inside of Hadoop, not just available on a regular filesystem.

@JAllemandou, I think it would be fine to schedule the load of the dumps into HDFS from a stat box instead of an-coord1001, if @elukey is ok with it.

@Ottomata , @ArielGlenn - I'm ok with copying the dumps from a stat machine.
Let's see what @elukey thinks of it

Change 473161 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move import_wikitext_dumps to stat1007

https://gerrit.wikimedia.org/r/473161

Change 473161 merged by Elukey:
[operations/puppet@production] Move import_wikitext_dumps to stat1007

https://gerrit.wikimedia.org/r/473161