Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Milimetric | T186559 Provide data dumps in the Analytics Data Lake | |||
Resolved | JAllemandou | T202489 Copy monthly XML files from public-dumps to HDFS |
Event Timeline
Oo, ok, how do you do this usually? Can you list the paths and files and how you get them?
What I've done manually until now is referenced here: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/409960
I have started a python script to do pretty much the same thing with more flexibility and incremental growth.
Comments/ideas welcome :)
Cool, I wonder if we can somehow schedule this with oozie instead of cron. Can Oozie look for local path creation instead of HDFS? I would assume so.
@diego I agree we could use NiFi to copy the files, but it seems too much work in term of system setup for this single use-case.
Change 456654 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] [WIP] Add python script importing xml dumps onto hdfs
Change 459780 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update python/refinery/utils/HdfsUtils
Change 459780 merged by Joal:
[analytics/refinery@master] Update python/refinery/utils/HdfsUtils
Change 456654 merged by Joal:
[analytics/refinery@master] Add python script importing xml dumps onto hdfs
Change 472472 had a related patch set uploaded (by Elukey; owner: Joal):
[operations/puppet@production] Add timer importing page-history dumps to hadoop
Change 472472 merged by Elukey:
[operations/puppet@production] Add timer importing page-history dumps to hadoop
Excuse me for butting in at this late date but these files are already available from labstore1006,7 to labs instances and on stats100? (I forget which one now). Do you need them to be available somewhere else?
I don't have any objections on principle to having another copy floating around, it would just be nice to make sure it's not redundant.
Hi @ArielGlenn, the reason for which we need the files on an-coord1001 is because it is the one machine responsible for crons/systemd-timers jobs in our infra while the other stat1005 machine is user-jobs oriented.
This task being about productionizing the import of the files, it involves the machine responsible for prod-jobs. Ok on your side?
I mean, it's fine, but maybe it's better to just provide them as is done on stat100? (5? 7?) via nfs mount from labstore1006 (7?). Looping @Bstorm in for her opinion, as she is one of the point people for those servers. I don't mean to slow this down at all, just if there's a simpler solution than copy them over, maybe we should go for it.
@ArielGlenn, they need to be copied into HDFS inside of Hadoop, not just available on a regular filesystem.
@JAllemandou, I think it would be fine to schedule the load of the dumps into HDFS from a stat box instead of an-coord1001, if @elukey is ok with it.
@Ottomata , @ArielGlenn - I'm ok with copying the dumps from a stat machine.
Let's see what @elukey thinks of it
Change 473161 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move import_wikitext_dumps to stat1007
Change 473161 merged by Elukey:
[operations/puppet@production] Move import_wikitext_dumps to stat1007