I want to be able to run analysis on the api.log files currently stored on fluorine and get my data to datasets.wikimedia.org which as far as I know means getting my data to stat1002 or stat1003.
The data I want to move contains no private info:
action=wbgetclaims property stats 11019262 property=P373 11761 property=P227 2177 property=P735 2176 property=P27 536 property=P1630 156 property=P31 150 property=P625 146 property=P657 73 property=P715 73 property=P683 73 property=P665 73 property=P662 73 property=P661 73 property=P592 73 property=P235 73 property=P234 73 property=P233 73 property=P232 73 property=P231 50 property=P22 21 property=P25 8 property=P345 6 property=P569 6 property=P40 4 property=P21 3 property=P297 2 property=P35 2 property=P2 1 property=P3 1 property=P1
I briefly spoke to @jcrespo about this and he said that the best way forward would be to file a ticket with the details of what I need.
I would plan on these stats being pulled out by a cron and then either written to a file that can be transferred to the analytics cluster or perhaps written straight into a db on the analytics cluster.
So the above is what I would like the be able todo!
I see that some other log archives are copied to the stat servers )but they are much smaller) doing this with the api.log archives would result in 800GB of wasted space.
I would guess it is not possible to access the analytics dbs from fluorine.
Perhaps an rsync from somewhere on flourine to somewhere on the analytics cluster might be best? So I can do my analysis and put the output there.
@jcrespo also mentioned firewall rules or custom ssh keys but I think some sort of rsync might make the most sense?
All comments welcome! :)