Page MenuHomePhabricator

Setup a mirror for R language dependencies (CRAN)
Closed, DeclinedPublic10 Estimated Story Points

Description

We are setting up CI jobs that install dependencies from http://cran.us.r-project.org . On review @mpopov suggested to setup a mirroring to save some bits of bandwidth and remove a dependency on a third party.

https://cran.r-project.org/mirror-howto.html

Used 120GB in October 2012, I have sent an email to ask for the actual size.

Need to send a public SSH key and then we can:

rsync -e "ssh" -rtlzv --delete cran-rsync@cran.r-project.org: /srv/mirrors/cran

We already have puppet logic to add that and cron it.

SRE would it make sense to mirror it and consume a few hundred GBytes of disk?

Event Timeline

mpopov added subscribers: Gehel, Ottomata.

@hashar Thank you for making this ticket and emailing the R Foundation/R Development Core Team! Heh, yesterday I emailed @Ottomata & @Gehel asking if setting up our own CRAN mirror would be a reasonable thing.

In summary it was:

  • Our Puppet code relies on external CRAN mirrors (e.g. UC Berkeley) that we can't guarantee to be up, available, and secure
  • We probably don't want to use up other institutions' bandwidth with our CI jobs

From @Ottomata at https://gerrit.wikimedia.org/r/#/c/366170/:

Unless ops policy/preferences has changed, I dunno if this is gonna fly! :o

We don't automate pulling in any remote dependencies except from a few trusted vendors (mostly just Debian). For everything else, ops requires that the deps are somehow statically locked down on our servers.

For Java, we use archiva to semi-manually mirror artifacts from remote maven repos.

Setting up an R mirror could work, but I don’t think we’d be allowed to just blindly rsync everything from upstream. Instead, packages and dependencies would have to be (semi?) manually copied to the mirror.

From @Gehel:

To answer Ottomata: the shiny_server module is used only on labs at the moment (to expose discovery dashboards). Labs constraints are more lax than production, and this code is fine there (as discussed with Moritzm). Downloading from external sources will fail in production (it would need a proxy) so we are good sofar.

SRE: Since there are potential security issues with running rsync, would it be possible to use Microsoft's CRAN snapshots? https://mran.microsoft.com/timemachine/

Update on size: According to an acquaintance, as of 2017-07-25:

/Volumes/cran on my home CRAN (which is a full mirror now) is 153 GB

He also showed the breakdown:

@bearloga 104./bin/
1./contrib/
1./doc/
0./help/
1./html/
1./mirmon/
45./src/
4./web/

./bin/
8./linux/
1./macos/
53./macosx/
44./windows/

And apparently a bulk of that is binary versions for macOS/Linux/Windows anyway, but since we install from source on our machines anyway, we can actually have a way slimmer mirror that is source-only and it'd be like around 50GB! O_O

@mpopov using twitter to get the size was a smart move :-]

Assuming 150GB, mirrors.wikimedia.org is hosted on sodium. The grafana server board shows its /srv has 2.4 TBytes used and 8.4 TBytes free.

What prompted me to fill in that task was for our CI to hit a local resource after Mikhail pointed it would save some bandwith. Most importantly does not make CI depends upon a third party mirror. I will probably fill tasks to requests mirroring of others repos such as rubygems/pypi/nodejs. Depends whether that fit/make sense.


For production, for sure the packages will not be installed from a mirror of files uploaded by random people. Beside the Archiva solution for Java, we could then either use:

  • a git repo holding the compiled packages and deploy using scap (done by nodejs services or python based ORES)
  • build Debian packages the dependencies

But I think that is slightly out of scope of this task ;o]

maybe one day if we look again at R