Page MenuHomePhabricator

Moving analysis data from flourine to analytics cluster
Closed, ResolvedPublic

Description

I want to be able to run analysis on the api.log files currently stored on fluorine and get my data to datasets.wikimedia.org which as far as I know means getting my data to stat1002 or stat1003.
The data I want to move contains no private info:

action=wbgetclaims property stats
11019262 property=P373
  11761 property=P227
   2177 property=P735
   2176 property=P27
    536 property=P1630
    156 property=P31
    150 property=P625
    146 property=P657
     73 property=P715
     73 property=P683
     73 property=P665
     73 property=P662
     73 property=P661
     73 property=P592
     73 property=P235
     73 property=P234
     73 property=P233
     73 property=P232
     73 property=P231
     50 property=P22
     21 property=P25
      8 property=P345
      6 property=P569
      6 property=P40
      4 property=P21
      3 property=P297
      2 property=P35
      2 property=P2
      1 property=P3
      1 property=P1

I briefly spoke to @jcrespo about this and he said that the best way forward would be to file a ticket with the details of what I need.
I would plan on these stats being pulled out by a cron and then either written to a file that can be transferred to the analytics cluster or perhaps written straight into a db on the analytics cluster.

So the above is what I would like the be able todo!

I see that some other log archives are copied to the stat servers )but they are much smaller) doing this with the api.log archives would result in 800GB of wasted space.
I would guess it is not possible to access the analytics dbs from fluorine.
Perhaps an rsync from somewhere on flourine to somewhere on the analytics cluster might be best? So I can do my analysis and put the output there.
@jcrespo also mentioned firewall rules or custom ssh keys but I think some sort of rsync might make the most sense?

All comments welcome! :)

Event Timeline

Addshore raised the priority of this task from to Medium.
Addshore updated the task description. (Show Details)
Addshore added projects: acl*sre-team, Analytics.
Addshore added subscribers: Addshore, jcrespo.
Addshore updated the task description. (Show Details)
Addshore added a subscriber: Ottomata.

Yes, we can do this. fluorine already has an rsyncd running that allows stat1002 to copy files. This would just be a matter of adding a cron job to rsync them to stat1002.

I'm guessing we don't want to rsync the archived logs files themselves (as that is basically 800GB of duplicated data) or does 800GB not matter?

And if we just setup some other directory to rsync across where should it be and what should it be called? :)

Change 238798 had a related patch set uploaded (by Addshore):
Rsync api log archives from fluorine to stat1002

https://gerrit.wikimedia.org/r/238798

This will actually result in roughly 2.4T currently as the retention is 90 days on stat1002

Change 239830 had a related patch set uploaded (by Ottomata):
Rename fluorine api rsync job to mw-api to avoid conflict with webrequest api log rsync job

https://gerrit.wikimedia.org/r/239830

Change 239830 merged by Ottomata:
Rename fluorine api rsync job to mw-api to avoid conflict with webrequest api log rsync job

https://gerrit.wikimedia.org/r/239830

Change 239840 had a related patch set uploaded (by Ottomata):
Fix wildcard in rsync for api.log

https://gerrit.wikimedia.org/r/239840

Change 239840 merged by Ottomata:
Fix identation and wildcard in rsync for api.log

https://gerrit.wikimedia.org/r/239840

Addshore claimed this task.