Page MenuHomePhabricator

[Commons Impact Metrics] Create Airflow job that generates the public dumps
Open, LowPublic8 Estimated Story Points

Description

After the base pipeline is ready T358699, we should create an Airflow DAG that formats and writes the monthly dump files to a public location.
This work includes writing SparkSql queries to format the data into the desired file shape.
The DAG should execute those queries, and then archive the results in its final place.
This DAG should also be monthly, and run right after the base pipeline finishes (a sensor should wait for the base datasets to be present for the month in question).
Initially, each base dataset should have its dump. And the contents of the dumps should be pretty much the same as the base dataset.

Maybe we don't need an extra DAG for this, maybe the dumps creation can be done in the original pipeline DAG?

Tasks:

  • Design the format of the dumps (tsv? compression? naming? location? etc)
  • Write the queries that format the base Commons Impact Metrics datasets into the expected file shape.
  • Write the Airflow DAG that waits for the base data to be present, executes the queries and archives the resulting files.
  • Test in Airflow's dev instance
  • Vet the generated files
  • Code-review and deploy
  • Make sure there are mechanisms in place to make the generated dump files public, rsync?
  • Write the description page in dumps.wikimedia.org (readme)

Definition of done:

  • The queries work properly and are in the corresponding repo (probably refinery?)
  • The DAG is in production and running
  • There's a fist version of the dumps in the chosen public location (probably the same as other dumps like pageview dumps)

Event Timeline

mforns updated the task description. (Show Details)
mforns updated the task description. (Show Details)
mforns set the point value for this task to 8.

Change #1019845 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery@master] Add queries to format commons impact metrics data as dumps

https://gerrit.wikimedia.org/r/1019845

Change #1019845 merged by Xcollazo:

[analytics/refinery@master] Add queries to format commons impact metrics data as dumps

https://gerrit.wikimedia.org/r/1019845

Regarding the size of the dumps, tl;dr:

  • category_metrics_snapshot, edits and pageviews_by_media_file are small enough that we don't have to worry.
  • commons_media_file_metrics_snapshot and pageviews_by_category are quite small now, Once we add the BaGLAMa2 categories to the allow-list, we modeled a multiplier factor of around x7. Which is still within manageable size.
  • Now, considering progressive addition to the allow-list by Community liaisons and the natural growth of Commons throughout the years, the last 2 datasets might get close to the limit of what we consider a manageable dump long-term (or maybe mid-term?, depends on the monthly rate of additions to the allow-list).
  • If we get to a point where those dumps are too big, there's things we could do: 1) Split the dump files in chunks (not trivial, because of items having multiple primary_categories); 2) Reduce the depth level of the graph, i.e. from 10 to 5, which would approximately half the size of the dumps; 3) Replace the ancestor category names with category ids, which would drastically reduce the size of the dumps, but require developers to download and join with the mediawiki_page table; 4) Establish a threshold for pageviews and only report on media files and categories that have more than N pageviews, i.e. N=10.

Here's the size of the monthly dump files without the BaGLAMa2 allow-listed categories.

mforns@stat1007:~/refinery/hql/commons_impact_metrics$ hdfs dfs -ls -h archive/commons/category_metrics_snapshot
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 1 items
-rw-r--r--   3 analytics-privatedata analytics-privatedata-users    399.4 K 2024-04-19 14:51 archive/commons/category_metrics_snapshot/commons_category_metrics_snapshot_2023-10.tsv.bz2

mforns@stat1007:~/refinery/hql/commons_impact_metrics$ hdfs dfs -ls -h archive/commons/media_file_metrics_snapshot
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 1 items
-rw-r--r--   3 analytics-privatedata analytics-privatedata-users    206.7 M 2024-04-19 15:00 archive/commons/media_file_metrics_snapshot/commons_media_file_metrics_snapshot_2023-10.tsv.bz2

mforns@stat1007:~/refinery/hql/commons_impact_metrics$ hdfs dfs -ls -h archive/commons/pageviews_by_category
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 1 items
-rw-r--r--   3 analytics-privatedata analytics-privatedata-users    125.6 M 2024-04-19 15:05 archive/commons/pageviews_by_category/commons_pageviews_by_category_2023-10.tsv.bz2

mforns@stat1007:~/refinery/hql/commons_impact_metrics$ hdfs dfs -ls -h archive/commons/pageviews_by_media_file
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 1 items
-rw-r--r--   3 analytics-privatedata analytics-privatedata-users     25.3 M 2024-04-19 15:06 archive/commons/pageviews_by_media_file/commons_pageviews_by_media_file_2023-10.tsv.bz2

mforns@stat1007:~/refinery/hql/commons_impact_metrics$ hdfs dfs -ls -h archive/commons/edits
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 1 items
-rw-r--r--   3 analytics-privatedata analytics-privatedata-users      4.6 M 2024-04-19 15:08 archive/commons/edits/commons_edits_2023-10.tsv.bz2
BTullis added subscribers: SGupta-WMF, BTullis.

@SGupta-WMF I'm claiming this task while I work on the publication part, if that's OK.

Change #1026162 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the commons impact metrics dumps fetcher and readme

https://gerrit.wikimedia.org/r/1026162

I have created the patch to do the publishing, but I am unsure what to write in the HTML parts. @mforns is going to ask someone to fill that part in, or supply the copy to me.
Until then, I will mark it as blocked/waiting on our workboard.

filling out the readme right now, thanks Ben!

ok, I didn't do much here, just provided a very short description and detailed out the schemas as Marcel had them in the design doc. Please let me know if anyone was imagining something else.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026162/3/modules/dumps/files/web/html/commons_readme.html

Change #1026162 merged by Btullis:

[operations/puppet@production] Add the commons impact metrics dumps fetcher and readme

https://gerrit.wikimedia.org/r/1026162

I have manually tested that the hdfs-rsync works, as shown.

btullis@clouddumps1001:~$ journalctl -u hdfs_rsync_commons_impact_metrics.service
-- Journal begins at Sat 2024-03-23 07:41:36 UTC, ends at Thu 2024-05-02 10:56:00 UTC. --
-- No entries --
btullis@clouddumps1001:~$ sudo systemctl restart hdfs_rsync_commons_impact_metrics.service
btullis@clouddumps1001:~$ journalctl -u hdfs_rsync_commons_impact_metrics.service
-- Journal begins at Sat 2024-03-23 07:41:36 UTC, ends at Thu 2024-05-02 11:04:00 UTC. --
May 02 10:56:23 clouddumps1001 systemd[1]: Starting Copy commons_impact_metrics files from Hadoop HDFS....
May 02 10:56:23 clouddumps1001 kerberos-run-command[2583478]: User dumpsgen executes as user dumpsgen the command ['/usr/local/bin/hdfs_rsync_commons_impact_metrics']
May 02 10:57:58 clouddumps1001 systemd[1]: hdfs_rsync_commons_impact_metrics.service: Succeeded.
May 02 10:57:58 clouddumps1001 systemd[1]: Finished Copy commons_impact_metrics files from Hadoop HDFS..
May 02 10:57:58 clouddumps1001 systemd[1]: hdfs_rsync_commons_impact_metrics.service: Consumed 1min 32.913s CPU time.

They are now present on the dumps server too:

image.png (726×1 px, 119 KB)

image.png (585×687 px, 58 KB)

We may still want to polish up the text of the readme a bit, including the linked text on https://dumps.wikimedia.org/other/analytics which I wrote and lacks detail, compared with the other dumps that are listed.
image.png (518×1 px, 81 KB)

Anyway, the key thing is that they have now been published. I'll let someone else close the ticket and/or submit any patches to change the link text etc.

Change #1026597 had a related patch set uploaded (by Milimetric; author: Milimetric):

[operations/puppet@production] Update commons impact metrics readme

https://gerrit.wikimedia.org/r/1026597

Change #1026611 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Change the destination directory for commons impact metrics dumps

https://gerrit.wikimedia.org/r/1026611

Change #1026611 merged by Btullis:

[operations/puppet@production] Change the destination directory for commons impact metrics dumps

https://gerrit.wikimedia.org/r/1026611

Change #1026613 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the link in the commons_impact_metrics readme file

https://gerrit.wikimedia.org/r/1026613

Change #1026613 merged by Btullis:

[operations/puppet@production] Update the link in the commons_impact_metrics readme file

https://gerrit.wikimedia.org/r/1026613

Change #1026618 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the path of the readme file for commons impact metrics

https://gerrit.wikimedia.org/r/1026618

Change #1026618 merged by Btullis:

[operations/puppet@production] Update the path of the readme file for commons impact metrics

https://gerrit.wikimedia.org/r/1026618

This seems to be done on the SRE side. @mforns I'll let you close when it's good for you.

Change #1026597 merged by Bking:

[operations/puppet@production] Update commons impact metrics readme

https://gerrit.wikimedia.org/r/1026597