Moving analysis data from flourine to analytics cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Addshore
	Sep 16 2015, 11:35 AM

Description

I want to be able to run analysis on the api.log files currently stored on fluorine and get my data to datasets.wikimedia.org which as far as I know means getting my data to stat1002 or stat1003.
The data I want to move contains no private info:

action=wbgetclaims property stats
11019262 property=P373
  11761 property=P227
   2177 property=P735
   2176 property=P27
    536 property=P1630
    156 property=P31
    150 property=P625
    146 property=P657
     73 property=P715
     73 property=P683
     73 property=P665
     73 property=P662
     73 property=P661
     73 property=P592
     73 property=P235
     73 property=P234
     73 property=P233
     73 property=P232
     73 property=P231
     50 property=P22
     21 property=P25
      8 property=P345
      6 property=P569
      6 property=P40
      4 property=P21
      3 property=P297
      2 property=P35
      2 property=P2
      1 property=P3
      1 property=P1

I briefly spoke to @jcrespo about this and he said that the best way forward would be to file a ticket with the details of what I need.
I would plan on these stats being pulled out by a cron and then either written to a file that can be transferred to the analytics cluster or perhaps written straight into a db on the analytics cluster.

So the above is what I would like the be able todo!

I see that some other log archives are copied to the stat servers )but they are much smaller) doing this with the api.log archives would result in 800GB of wasted space.
I would guess it is not possible to access the analytics dbs from fluorine.
Perhaps an rsync from somewhere on flourine to somewhere on the analytics cluster might be best? So I can do my analysis and put the output there.
@jcrespo also mentioned firewall rules or custom ssh keys but I think some sort of rsync might make the most sense?

All comments welcome! :)

Details

	Subject	Repo	Branch	Lines +/-
	Fix identation and wildcard in rsync for api.log	operations/puppet	production	+6 -6
	Rename fluorine api rsync job to mw-api to avoid conflict with webrequest api log rsync job	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: rOPUP9f96d90a8588: Fix identation and wildcard in rsync for api.log
rOPUP17a2ff9f0400: Rename fluorine api rsync job to mw-api to avoid conflict with webrequest api…

Event Timeline

Addshore created this task.Sep 16 2015, 11:35 AM

Addshore raised the priority of this task from to Medium.

Addshore updated the task description. (Show Details)

Addshore added projects: acl*sre-team, Analytics.

Addshore added subscribers: Addshore, jcrespo.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptSep 16 2015, 11:35 AM

Addshore set Security to None.Sep 16 2015, 11:35 AM

Addshore updated the task description. (Show Details)

Addshore added a subscriber: Ottomata.

Krenair subscribed.Sep 16 2015, 3:36 PM

Yes, we can do this. fluorine already has an rsyncd running that allows stat1002 to copy files. This would just be a matter of adding a cron job to rsync them to stat1002.

I'm guessing we don't want to rsync the archived logs files themselves (as that is basically 800GB of duplicated data) or does 800GB not matter?

And if we just setup some other directory to rsync across where should it be and what should it be called? :)

Legoktm updated the task description. (Show Details)Sep 16 2015, 5:37 PM

Change 238798 had a related patch set uploaded (by Addshore):
Rsync api log archives from fluorine to stat1002

https://gerrit.wikimedia.org/r/238798

gerritbot added a project: Patch-For-Review.Sep 16 2015, 5:37 PM

This will actually result in roughly 2.4T currently as the retention is 90 days on stat1002

Legoktm subscribed.Sep 19 2015, 7:09 AM

Change 239830 had a related patch set uploaded (by Ottomata):
Rename fluorine api rsync job to mw-api to avoid conflict with webrequest api log rsync job

https://gerrit.wikimedia.org/r/239830

Change 239830 merged by Ottomata:
Rename fluorine api rsync job to mw-api to avoid conflict with webrequest api log rsync job

https://gerrit.wikimedia.org/r/239830

Ottomata mentioned this in rOPUP17a2ff9f0400: Rename fluorine api rsync job to mw-api to avoid conflict with webrequest api….Sep 21 2015, 1:57 PM

Change 239840 had a related patch set uploaded (by Ottomata):
Fix wildcard in rsync for api.log

https://gerrit.wikimedia.org/r/239840

Change 239840 merged by Ottomata:
Fix identation and wildcard in rsync for api.log

https://gerrit.wikimedia.org/r/239840

Ottomata mentioned this in rOPUP9f96d90a8588: Fix identation and wildcard in rsync for api.log.Sep 21 2015, 2:21 PM

Addshore closed this task as Resolved.Sep 30 2015, 3:50 PM

Addshore claimed this task.

Moving analysis data from flourine to analytics clusterClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Moving analysis data from flourine to analytics cluster
Closed, ResolvedPublic
Actions