The data available on the public file server, https://analytics.wikimedia.org/published/, is rsync'ed to the webserver host from the /srv/published directories of a few instances (e.g. the stat machines). This is convenient for ad-hoc work but hard to automate, e.g. that mechanism is not available on the airflow instances.
The goal is to support moving data from HDFS to the file server in a standard way. Some options mentioned by @Ottomata: hdfs-rsync -> webserver host directly? or maybe hdfs-rsync -> some stat box’s /srv/published directory.
This ticket will broken into 2 parts:
- Decide on how we will implement this change (sprint 11)
- Implement agreed on approach (sprint 12)