Set up cron job on labstore to rsync data from stat* boxes into labs.
Closed, ResolvedPublic

Description

Ellery (and other researchers) have asked for a way to get large public data generated in production analytics networks to into labs where they can easily host their own interfaces to this data.

Yuvi and I talked about this, and we decide to set up a cron job on labstore that rsyncs data from stat1002.

statistics rsync server modules will need to allow reads from labstore, and the cron job will rsync a directory in /srv on stat1002.

Ottomata created this task.Jul 31 2015, 1:53 PM
Ottomata updated the task description. (Show Details)
Ottomata raised the priority of this task from to Normal.
Ottomata claimed this task.
Ottomata added subscribers: Ottomata, yuvipanda, ellery.
Restricted Application added a project: Labs. · View Herald TranscriptJul 31 2015, 1:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 228251 had a related patch set uploaded (by Ottomata):
Allow labstore1003 to rsync from stat servers

https://gerrit.wikimedia.org/r/228251

That patch ^ will allow labstore1003 to rsync from stat*::srv/...

No idea where to put a cronjob on labstore1003 in puppet.

Pinging @ArielGlenn to ask about rsync - where are the rsyncs for the dumps?

yuvipanda set Security to None.

We'll need to have strict and clear guidelines to make sure we don't leak private data. Paging @DarTar, @Halfak and @leila

Oof, am looking at all that. I really am not excited about setting up another cron job + script that double checks that it is not running before starting by using ps.

@yuvipanda, the more I think about it, the more I would like to keep this simple. I'm again wanting to just set up an rsync module on labstore1003 that avoids NFS and crons. We should just let people rsync there from stat1002/stat1003. That would also keep us from having to sync multiple directories on two (or more) stat hosts.

@Ottomata it isn't super hard to do, I can write it up if you'd like :) I think the 'one on one' correspondence between a stat host and the labstore will be super valuable and we should stick to it. We can just sync them to different folders from different machines.

Change 229262 had a related patch set uploaded (by Ottomata):
Set up writeable rsync module and NFS export of /srv/statistics to allow sharing of public data from stat boxes to labs

https://gerrit.wikimedia.org/r/229262

Change 229265 had a related patch set uploaded (by Yuvipanda):
labs: Allow projects to opt into a 'statistics' NFS mount

https://gerrit.wikimedia.org/r/229265

Change 229265 merged by Yuvipanda:
labs: Allow projects to opt into a 'statistics' NFS mount

https://gerrit.wikimedia.org/r/229265

Change 229262 merged by Ottomata:
labs: Setup /srv/statistics for rsync from stats hosts

https://gerrit.wikimedia.org/r/229262

@mark or @akosiaris or @faidon, we'll need a hole punched in the Analytics VLAN ACL for this.

Can you allow Analytics network to talk to labstore1003.eqiad.wmnet 10.64.4.10 to rsync on TCP port 873?

Danke!

I am gonna recap this a bit just to make sure I 've understood correctly.

  • People are already moving data from the production stat* boxes to labs via scp or other means
  • The data is already aggregated/sanitized/deanonymized/whatever so that's deemed OK
  • Since that already happens via other means anyway, we seek to endorse it and allow it to happen easier directly in our infrastructure. For this to happen, we setup rsync between labstore (NFS for labs projects) and analytics network.

What I am concerned about though is non-aggregated/sanitized/deanonymized/whatever data making it to labs. And if the process of moving data around and rsyncing it is manual, that is pretty much bound to happen by mistake at some point. So, I am kind of worried about this. Is there any way we could set up a process that minimizes the chances of this happening ?

Gonna have to ping @kevinator and @DarTar on that one.

Note that people already have the ability to rsync things to http://datasets.wikimedia.org from stat boxes.

Also, note many of the researchers that are asking for this are ones who are intensely involved in privacy policy discussion and formation, so I think they know what can and can't be public, more so than I do. I know that isn't much consolation, because so many people have stat box access and who knows who will in the future.

Change 228251 abandoned by Ottomata:
Allow labstore1003 to rsync from stat servers

Reason:
We did this a slightly different way in another change.

https://gerrit.wikimedia.org/r/228251

Halfak added a comment.Aug 7 2015, 9:03 PM

Hey folks. I just wanted to hop in to +1. I put a new dataset up on datasets.wikimedia.org on a weekly basis. See the dates associated with the files here: http://datasets.wikimedia.org/public-datasets/enwiki/etc/

We have a practice of discussing anything with potential privacy issues before posting anything publicly. It seems like this labs Rsync could apply to the same policy.

Really, maybe we should just sync the 'public-datasets' directory and call it done since that directory is already 100% public anyway -- we wouldn't be introducing any new privacy/security concerns.

Hey @Halfak, glad you joined us on this one.

Hey folks. I just wanted to hop in to +1. I put a new dataset up on datasets.wikimedia.org on a weekly basis. See the dates associated with the files here: http://datasets.wikimedia.org/public-datasets/enwiki/etc/

Just curious, manually or automagically ?

We have a practice of discussing anything with potential privacy issues before posting anything publicly. It seems like this labs Rsync could apply to the same policy.

Yeah sure, my point is not about policies though. Please do note I am not against this task happening. It will obviously be useful or else it would not be asked for. It's about how we can minimize the chances of a mistake happening.

Really, maybe we should just sync the 'public-datasets' directory and call it done since that directory is already 100% public anyway -- we wouldn't be introducing any new privacy/security concerns.

Not sure I follow here, care to elaborate ?

My understanding is that halfak is suggesting that we automatically sync datasets.wikimedia.org (which lives on dataset*** hosts, IIRC?) with the labstore1003 mount, instead of having it be a separate rsync module that people can push to.

My understanding is that halfak is suggesting that we automatically sync datasets.wikimedia.org (which lives on dataset*** hosts, IIRC?) with the labstore1003 mount, instead of having it be a separate rsync module that people can push to.

If that's the case, I am fine with that.

which directories do you want synced over?

datasets.wikimedia.org lives on stat1001. The contents of it are rsynced from various /srv locations on stat1002 and stat1003. Halfak is talking about syncing the public-datasets directory, which originates from stat1003. @ellery wants a way to get data from stat1002. It looks like aggregate-datasets comes from stat1002. We could just sync the whole of /srv/datasets.wikimedia.org from stat1001 as Yuvi suggests, but that would introduce an even larger lag waiting for 2 or 3 crons to run (stat1002 -> stat1001 -> labstore1003).

I think that folks would like to be able to push data on their own schedules, so they don't have to wait for some cron to run in order to continue their work in labs.

And poked again. I 'd be lying if I said I am not still ambivalent on this. I finally understood the use cases in mind. I was derailed for a while by the datasets comments but it's obvious now to me it's not the best way forward. In fact it became obvious we are talking mostly about speeding one off moves of public data into labs.

Looking back at the discussion both on this task as well as on IRC, my concern about private data making it into labs by mistake has not been addressed. That being said, I came to understand that every researcher/analyst already takes the burden of not allowing that mistake to happen when publishing public datasets.

With all that in mind, I think we should allow this to happen. So I am gonna open up the network hole in analytics VLAN to labstore. Don't make me regret this please.

akosiaris closed this task as Resolved.Aug 21 2015, 4:27 PM

Done and checked. Resolving

ellery added a comment.Sep 8 2015, 3:43 PM

Thanks Otto!