⚓ T358701 [Commons Impact Metrics] Create Airflow job that generates the public dumps

Subject	Repo	Branch	Lines +/-
Update commons impact metrics readme	operations/puppet	production	+3 -204
Update the path of the readme file for commons impact metrics	operations/puppet	production	+2 -2
Update the link in the commons_impact_metrics readme file	operations/puppet	production	+1 -1
Change the destination directory for commons impact metrics dumps	operations/puppet	production	+2 -2
Add the commons impact metrics dumps fetcher and readme	operations/puppet	production	+261 -0
Add queries to format commons impact metrics data as dumps	analytics/refinery	master	+215 -0

		Status	Subtype	Assigned	Task
		Open		None	T358673 [Epic] Commons Impact Metrics Implementation
		Open		None	T358701 [Commons Impact Metrics] Create Airflow job that generates the public dumps

mforns created this task.Feb 28 2024, 6:09 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 28 2024, 6:09 PM

mforns updated the task description. (Show Details)Feb 28 2024, 6:10 PM

mforns updated the task description. (Show Details)Feb 28 2024, 6:12 PM

mforns updated the task description. (Show Details)

mforns added a parent task: T358673: [Epic] Commons Impact Metrics Implementation.Feb 28 2024, 6:25 PM

mforns mentioned this in T358710: [Commons Impact Metrics] Make dumps accessible from analytics.wikimedia.org.Feb 28 2024, 6:58 PM

mforns mentioned this in T358673: [Epic] Commons Impact Metrics Implementation.Feb 28 2024, 9:02 PM

VirginiaPoundstone moved this task from Incoming requests to Q3 23/24 on the Commons-Impact-Metrics board.Mar 1 2024, 6:25 PM

VirginiaPoundstone moved this task from Incoming to Data Products Sprint 11 on the Data Products board.Mar 18 2024, 7:37 PM

VirginiaPoundstone edited projects, added Data Products (Data Products Sprint 11); removed Data Products.

mforns updated the task description. (Show Details)Mar 21 2024, 2:17 PM

mforns set the point value for this task to 8.

mforns triaged this task as Low priority.Mar 21 2024, 2:57 PM

VirginiaPoundstone edited projects, added Data Products (Data Products Sprint 12); removed Data Products (Data Products Sprint 11).Apr 12 2024, 4:21 PM

VirginiaPoundstone moved this task from Sprint Backlog to Sprint Goals on the Data Products (Data Products Sprint 12) board.

VirginiaPoundstone moved this task from Sprint Goals to Sprint Backlog on the Data Products (Data Products Sprint 12) board.

mforns moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 12) board.Apr 15 2024, 4:04 PM

VirginiaPoundstone assigned this task to SGupta-WMF.Apr 15 2024, 4:11 PM

VirginiaPoundstone updated Other Assignee, added: mforns.

Change #1019845 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery@master] Add queries to format commons impact metrics data as dumps

https://gerrit.wikimedia.org/r/1019845

gerritbot added a project: Patch-For-Review.Apr 15 2024, 4:45 PM

Change #1019845 merged by Xcollazo:

[analytics/refinery@master] Add queries to format commons impact metrics data as dumps

https://gerrit.wikimedia.org/r/1019845

Regarding the size of the dumps, tl;dr:

category_metrics_snapshot, edits and pageviews_by_media_file are small enough that we don't have to worry.

commons_media_file_metrics_snapshot and pageviews_by_category are quite small now, Once we add the BaGLAMa2 categories to the allow-list, we modeled a multiplier factor of around x7. Which is still within manageable size.

Now, considering progressive addition to the allow-list by Community liaisons and the natural growth of Commons throughout the years, the last 2 datasets might get close to the limit of what we consider a manageable dump long-term (or maybe mid-term?, depends on the monthly rate of additions to the allow-list).

If we get to a point where those dumps are too big, there's things we could do: 1) Split the dump files in chunks (not trivial, because of items having multiple primary_categories); 2) Reduce the depth level of the graph, i.e. from 10 to 5, which would approximately half the size of the dumps; 3) Replace the ancestor category names with category ids, which would drastically reduce the size of the dumps, but require developers to download and join with the mediawiki_page table; 4) Establish a threshold for pageviews and only report on media files and categories that have more than N pageviews, i.e. N=10.

Here's the size of the monthly dump files without the BaGLAMa2 allow-listed categories.

mforns@stat1007:~/refinery/hql/commons_impact_metrics$ hdfs dfs -ls -h archive/commons/category_metrics_snapshot
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 1 items
-rw-r--r--   3 analytics-privatedata analytics-privatedata-users    399.4 K 2024-04-19 14:51 archive/commons/category_metrics_snapshot/commons_category_metrics_snapshot_2023-10.tsv.bz2

mforns@stat1007:~/refinery/hql/commons_impact_metrics$ hdfs dfs -ls -h archive/commons/media_file_metrics_snapshot
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 1 items
-rw-r--r--   3 analytics-privatedata analytics-privatedata-users    206.7 M 2024-04-19 15:00 archive/commons/media_file_metrics_snapshot/commons_media_file_metrics_snapshot_2023-10.tsv.bz2

mforns@stat1007:~/refinery/hql/commons_impact_metrics$ hdfs dfs -ls -h archive/commons/pageviews_by_category
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 1 items
-rw-r--r--   3 analytics-privatedata analytics-privatedata-users    125.6 M 2024-04-19 15:05 archive/commons/pageviews_by_category/commons_pageviews_by_category_2023-10.tsv.bz2

mforns@stat1007:~/refinery/hql/commons_impact_metrics$ hdfs dfs -ls -h archive/commons/pageviews_by_media_file
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 1 items
-rw-r--r--   3 analytics-privatedata analytics-privatedata-users     25.3 M 2024-04-19 15:06 archive/commons/pageviews_by_media_file/commons_pageviews_by_media_file_2023-10.tsv.bz2

mforns@stat1007:~/refinery/hql/commons_impact_metrics$ hdfs dfs -ls -h archive/commons/edits
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 1 items
-rw-r--r--   3 analytics-privatedata analytics-privatedata-users      4.6 M 2024-04-19 15:08 archive/commons/edits/commons_edits_2023-10.tsv.bz2

mforns updated the task description. (Show Details)Apr 19 2024, 6:05 PM

Oh, and here's the notes from the dumps design session. https://docs.google.com/document/d/1he7AziO1CYAFPeyRGgs5Ez-4G_jRDvAl0l8zsBEtc9A

xcollazo moved this task from In Process to Paused on the Data Products (Data Products Sprint 12) board.Wed, Apr 24, 4:18 PM

mforns updated the task description. (Show Details)Mon, Apr 29, 1:54 PM

Maintenance_bot removed a project: Patch-For-Review.Mon, Apr 29, 2:31 PM

@SGupta-WMF I'm claiming this task while I work on the publication part, if that's OK.

BTullis moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.Wed, May 1, 2:31 PM

Change #1026162 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the commons impact metrics dumps fetcher and readme

https://gerrit.wikimedia.org/r/1026162

gerritbot added a project: Patch-For-Review.Wed, May 1, 3:08 PM

I have created the patch to do the publishing, but I am unsure what to write in the HTML parts. @mforns is going to ask someone to fill that part in, or supply the copy to me.
Until then, I will mark it as blocked/waiting on our workboard.

VirginiaPoundstone edited projects, added Data Products (Data Products Sprint 13); removed Data Products (Data Products Sprint 12).Wed, May 1, 6:43 PM

VirginiaPoundstone moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 13) board.

filling out the readme right now, thanks Ben!

ok, I didn't do much here, just provided a very short description and detailed out the schemas as Marcel had them in the design doc. Please let me know if anyone was imagining something else.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026162/3/modules/dumps/files/web/html/commons_readme.html

Milimetric moved this task from In Process to Code Review / Tech Input on the Data Products (Data Products Sprint 13) board.Wed, May 1, 8:16 PM

Change #1026162 merged by Btullis:

[operations/puppet@production] Add the commons impact metrics dumps fetcher and readme

https://gerrit.wikimedia.org/r/1026162

I have manually tested that the hdfs-rsync works, as shown.

btullis@clouddumps1001:~$ journalctl -u hdfs_rsync_commons_impact_metrics.service
-- Journal begins at Sat 2024-03-23 07:41:36 UTC, ends at Thu 2024-05-02 10:56:00 UTC. --
-- No entries --
btullis@clouddumps1001:~$ sudo systemctl restart hdfs_rsync_commons_impact_metrics.service
btullis@clouddumps1001:~$ journalctl -u hdfs_rsync_commons_impact_metrics.service
-- Journal begins at Sat 2024-03-23 07:41:36 UTC, ends at Thu 2024-05-02 11:04:00 UTC. --
May 02 10:56:23 clouddumps1001 systemd[1]: Starting Copy commons_impact_metrics files from Hadoop HDFS....
May 02 10:56:23 clouddumps1001 kerberos-run-command[2583478]: User dumpsgen executes as user dumpsgen the command ['/usr/local/bin/hdfs_rsync_commons_impact_metrics']
May 02 10:57:58 clouddumps1001 systemd[1]: hdfs_rsync_commons_impact_metrics.service: Succeeded.
May 02 10:57:58 clouddumps1001 systemd[1]: Finished Copy commons_impact_metrics files from Hadoop HDFS..
May 02 10:57:58 clouddumps1001 systemd[1]: hdfs_rsync_commons_impact_metrics.service: Consumed 1min 32.913s CPU time.

They are now present on the dumps server too:

We may still want to polish up the text of the readme a bit, including the linked text on https://dumps.wikimedia.org/other/analytics which I wrote and lacks detail, compared with the other dumps that are listed.

Anyway, the key thing is that they have now been published. I'll let someone else close the ticket and/or submit any patches to change the link text etc.

BTullis updated the task description. (Show Details)Thu, May 2, 11:13 AM

Maintenance_bot removed a project: Patch-For-Review.Thu, May 2, 11:31 AM

Change #1026597 had a related patch set uploaded (by Milimetric; author: Milimetric):

[operations/puppet@production] Update commons impact metrics readme

https://gerrit.wikimedia.org/r/1026597

gerritbot added a project: Patch-For-Review.Thu, May 2, 3:27 PM

Change #1026611 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Change the destination directory for commons impact metrics dumps

https://gerrit.wikimedia.org/r/1026611

Change #1026611 merged by Btullis:

[operations/puppet@production] Change the destination directory for commons impact metrics dumps

https://gerrit.wikimedia.org/r/1026611

Change #1026613 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the link in the commons_impact_metrics readme file

https://gerrit.wikimedia.org/r/1026613

Change #1026613 merged by Btullis:

[operations/puppet@production] Update the link in the commons_impact_metrics readme file

https://gerrit.wikimedia.org/r/1026613

Change #1026618 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the path of the readme file for commons impact metrics

https://gerrit.wikimedia.org/r/1026618

Change #1026618 merged by Btullis:

[operations/puppet@production] Update the path of the readme file for commons impact metrics

https://gerrit.wikimedia.org/r/1026618

This seems to be done on the SRE side. @mforns I'll let you close when it's good for you.

Change #1026597 merged by Bking:

[operations/puppet@production] Update commons impact metrics readme

https://gerrit.wikimedia.org/r/1026597

Maintenance_bot removed a project: Patch-For-Review.Fri, May 3, 2:31 PM

cjming moved this task from Code Review / Tech Input to Done on the Data Products (Data Products Sprint 13) board.Mon, May 6, 4:08 PM

mforns moved this task from Q3 23/24 to Done on the Commons-Impact-Metrics board.Tue, May 14, 3:00 PM

[Commons Impact Metrics] Create Airflow job that generates the public dumps
Open, LowPublic8 Estimated Story Points
Actions

Description

Details

Related Objects
Search...

Event Timeline

	F49799621: image.png
	Thu, May 2, 11:13 AM

	F49799415: image.png
	Thu, May 2, 11:13 AM

	F49799326: image.png
	Thu, May 2, 11:13 AM

[Commons Impact Metrics] Create Airflow job that generates the public dumpsOpen, LowPublic8 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

[Commons Impact Metrics] Create Airflow job that generates the public dumps
Open, LowPublic8 Estimated Story Points
Actions

Related Objects
Search...