Page MenuHomePhabricator

Support for moving data from HDFS to public http file server
Closed, ResolvedPublic5 Estimated Story Points

Description

The data available on the public file server, https://analytics.wikimedia.org/published/, is rsync'ed to the webserver host from the /srv/published directories of a few instances (e.g. the stat machines). This is convenient for ad-hoc work but hard to automate, e.g. that mechanism is not available on the airflow instances.

The goal is to support moving data from HDFS to the file server in a standard way. Some options mentioned by @Ottomata: hdfs-rsync -> webserver host directly? or maybe hdfs-rsync -> some stat box’s /srv/published directory.

This ticket will broken into 2 parts:

  • Decide on how we will implement this change (sprint 11)
  • Implement agreed on approach (sprint 12)

Event Timeline

This use-case is appearing more and more - We should prioritize this.

To support this use-case we would use the hdfs-rsync tool that mimic rsync (no copy of already copied files) but to and from HDFS.

I suggest we use the /wmf/data/published folder on HDFS to synchronize externally.

+1 to prioritizing this. My usecase for publishing data from HDFS is the following:

  • I use the platform-eng airflow instance to automate running a daily DP aggregation of pageview activity.
  • The privatized outputs of DP aggregations as Hive tables are saved as tables in the differential_privacy database that look like:
countryprojectpage_idgroupby_count
  • In order to make these tables legible to an outside consumer who doesn't have access to page_ids, I join these page_ids with their corresponding page_title and save them as CSVs to hdfs:///tmp/country_project_page_daily_dag/<date> using join_titles.hql. This yields a table that looks like the following:
countryprojectpage_idpage_titlegroupby_count

Hope this helps explain why this would be very useful for me! :) Let me know if you have any questions.

Cool! ^ almost sounds like something it would be nice to have in AQS! Not sure if there is desire/bandwidth to do that though. @Milimetric ?

@Ottomata: @Milimetric and I have talked about adding this data to AQS at some point in the short-/mid-term future, but I think we're going to wait for AQS 2.0 to be released before we start work on that

Okay, yeah, the best thing to do would be as @JAllemandou suggests I think then. It is a non airflow node & user specific solution. @lbowmaker we should consider prioritizing this.

It is a non airflow node & user specific solution.

I'm confused; @Htriedman would like to use this from Airflow, so we do need an Airflow operator for it, correct?

It is a non airflow node & user specific solution.

I'm confused; @Htriedman would like to use this from Airflow, so we do need an Airflow operator for it, correct?

We wouldn't need an airflow operator. Saving the data in the 'published' folder would be enough: an hdfs-rsync mechanism would then copy the file to the serving server.

It is a non airflow node & user specific solution.

I'm confused; @Htriedman would like to use this from Airflow, so we do need an Airflow operator for it, correct?

We wouldn't need an airflow operator. Saving the data in the 'published' folder would be enough: an hdfs-rsync mechanism would then copy the file to the serving server.

Ah, got it. Way better!

JArguello-WMF set the point value for this task to 5.

Hi all! Any updates on this? I'd love to be able to publish the DP data that is currently stuck in the hdfs:///tmp folder :)

JAllemandou moved this task from Next Up to In Progress on the Data Pipelines (Sprint 11) board.

I just took the task - I hope to be done before end of week :)

Starting point for a discussion on how this should be implemented.

We currently have 2 ways of publishing data on the web:

1 - Using dumps.wikimedia.org webservers and storage
For each dataset to be published (pageview, mediacount, unique-devices...) we have a systemd-timer setup on the dumps server to regularly synchronize data from a /wmf/data/archive/ HDFS subfolder to the webserver local storage to be published.
Code: https://github.com/wikimedia/operations-puppet/blob/e1e13a59de3021afaa43c31745abbe348a93017d/modules/dumps/manifests/web/fetches/stats.pp
Pros

  • synchronization jobs already exist
  • big storage available on the publishing host (~100Tb as of now)
  • 1-to-1 mapping between the data on HDFS and the published data (this allows to use the --delete option of hdfs-rsync to automatically delete data from the publishing side when it's gone from HDFS)

Cons

  • the dumps.wikimedia.org url usually provides 'formally defined' data, not user one-offs
  • webserver infra is not managed by DE

2 - Using the analytics.wikimedia.org webserver and storage
From each cluster-client machine (statX, an-launcher, notebookX) the /srv/published/{datasets|notebooks} folders are regularly rsynced to the an-web1001 local storage, each of the clients having its own synchronization folder in`/srv/published-rsynced/{client}. Then the dataset` and notebooks folders from each client are merged into the /srv/analytics.wikimedia.org/published/{datasets|notebooks} folders using hardlinks (this prevents data duplication), allowing to have a single folder view of data coming from multiple machines.
Code: https://github.com/wikimedia/operations-puppet/blob/d86003c32fa1b6875140b2c5431bc22556a9382b/modules/statistics/manifests/rsync/published.pp and https://github.com/wikimedia/operations-puppet/blob/d86003c32fa1b6875140b2c5431bc22556a9382b/modules/statistics/manifests/published.pp
Pros

  • The URL already handles user-one off data
  • webserver infra is managed by DE

Cons

    • The file merging process prevents to automatically delete published files when they are removed from HDFS, a manual operation is needed. this could be overcome by having a dedicated folder for HDFS synchronization (not the exisiting datasets or notebooks ones)
  • The available storage for publication is relatively small (~270G as of now)

Let's discuss the above and make decision :) Ping @Ottomata and @BTullis

I'd go with Option 2 for this if we can. Option 1 is nice, but I think putting things on dumps.wikimedia.org can and should require more 'formal' coordination as you say. We need more data product management! :)

For option 2, I think if we add (and maybe rename/refactor) a dumps::web::fetches::analytics::job to analytics-web that hdfs-rsyncs to /srv/published-rsynced/analytics-hadoop or something, the hardsync script will automatically make that stuff available at analyics.wikimedia.org

I'd go with Option 2 for this if we can.

Great - let's go for that :)

For option 2, I think if we add (and maybe rename/refactor) a dumps::web::fetches::analytics::job to analytics-web that hdfs-rsyncs to /srv/published-rsynced/analytics-hadoop or something, the hardsync script will automatically make that stuff available at analyics.wikimedia.org

I'll implement the solution as you mention.

The only downside I see in the approach is about data deletion if/when some needs to happen.
We'll reassess if this problem occurs more than we think is acceptable.

Yeah but I guess it is the same problem for the stat box synced data too :/

Yeah but I guess it is the same problem for the stat box synced data too :/

Absolutely - to me the fact that published data is not monitored/reviewed is problematic :)

Folders on HDFS created:
hdfs://analytics-hadoop/wmf/data/published/datasets

Yeah but I guess it is the same problem for the stat box synced data too :/

Absolutely - to me the fact that published data is not monitored/reviewed is problematic :)

Lifecycle! Sunsetting! :D

Change 910761 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Refactor dumps::web::fetches::analytics::job

https://gerrit.wikimedia.org/r/910761

Change 911855 had a related patch set uploaded (by Ottomata; author: Ottomata):

[labs/private@master] Add dumny keytab an-web1001.eqiad.wmnet/analytics.keyta

https://gerrit.wikimedia.org/r/911855

Change 911855 merged by Ottomata:

[labs/private@master] Add dumny keytab an-web1001.eqiad.wmnet/analytics.keyta

https://gerrit.wikimedia.org/r/911855

Change 911875 had a related patch set uploaded (by Ottomata; author: Ottomata):

[labs/private@master] Move an-web1001 keytabs to proper directory

https://gerrit.wikimedia.org/r/911875

Change 911875 merged by Ottomata:

[labs/private@master] Move an-web1001 keytabs to proper directory

https://gerrit.wikimedia.org/r/911875

Change 910761 merged by Ottomata:

[operations/puppet@production] Refactor dumps::web::fetches::analytics::job

https://gerrit.wikimedia.org/r/910761

Change 911913 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] hdfs_rsync job fixes

https://gerrit.wikimedia.org/r/911913

Change 911913 merged by Ottomata:

[operations/puppet@production] hdfs_rsync job fixes

https://gerrit.wikimedia.org/r/911913

Change 911915 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] hdfs_rsync - ensure old renamed systemd timers and script are absent

https://gerrit.wikimedia.org/r/911915

Change 911915 merged by Ottomata:

[operations/puppet@production] hdfs_rsync - ensure old renamed systemd timers and script are absent

https://gerrit.wikimedia.org/r/911915

Change 911916 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] hdfs_rsync - Remove absented

https://gerrit.wikimedia.org/r/911916

Change 911916 merged by Ottomata:

[operations/puppet@production] hdfs_rsync - Remove absented

https://gerrit.wikimedia.org/r/911916

This is done :)
@Htriedman you can now move your files to hdfs:///wmf/data/published/datasets/... and they'll be synchronized to the https://analytics.wikimedia.org/published/datasets/... url

Change 912316 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] hdfs_tools - Remove reference to non existent profile::analytics::hdfs_tools

https://gerrit.wikimedia.org/r/912316

Change 912316 merged by Ottomata:

[operations/puppet@production] hdfs_tools - Remove reference to non existent profile::analytics::hdfs_tools

https://gerrit.wikimedia.org/r/912316