We need to have an easy workflow to be able to move data files from analytics cluster to production. Those files could be anything but more often than not will be results of models that are shipped to some production system to make use of them. While we use kafka as a data bridge in many instances, kafka is not an option for thi suse case as most of the files we will produce/are producing are larger that 10 M.
For an example of a process that now produces files that requires manual push of those files to prod see the Search Platform team workflow to ship results of models calculated in the cluster to elasticsearch: https://wikitech.wikimedia.org/wiki/Search/MLR_Pipeline
A possible way to do this would be using rsync. Analytics would need to provide a workflow that can be used by oozie jobs or similar in which once files are produced are stored in hdfs and from there they are pushed via rsync mount to a known location in prod (this could be done on a box similar to a stats box dedicated to this purpose). The files will be pulled from production systems once ready. Ideally the workflow includes a message in a known kafka topic (or similar) that notifies a consumer when a file is ready to be pulled. This way the daemons running on, for example, the elasticsearch fleet could pull data when available and upload it.