Maniphest T208612

Release edit data lake data as a public json dump /mysql dump, other?
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	• Nuria
	Nov 2 2018, 5:52 PM

Description

Release edit data lake data publicy

Details

Subject	Repo	Branch	Lines +/-
analytics::refinery::job::data_purge: Correct timer syntax	operations/puppet	production	+1 -1
analytics::refinery::job::data_purge: Add timer to delete old MWH dumps	operations/puppet	production	+12 -1
web::fetches::stats.pp Absent mediawiki history rsync	operations/puppet	production	+1 -0
dumps::manifests::web::fetches::stats: correct path for mediawiki history	operations/puppet	production	+1 -1
Rsync analytics mediawiki history dumps to dumps.wikimedia.org	operations/puppet	production	+149 -1
Add Oozie job for mediawiki history dumps	analytics/refinery	master	+322 -0
Add spark job to create mediawiki history dumps	analytics/refinery/source	master	+349 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		mforns	T208612 Release edit data lake data as a public json dump /mysql dump, other?
		Resolved		mforns	T224459 Recommend the best format to release public data lake as a dump

Event Timeline

• Nuria created this task.Nov 2 2018, 5:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 2 2018, 5:52 PM

We think the research community can benefit from the edit data lake data in the form of a somewhat large text dump that contains json.

This dump will contain edit data denormalized for easy analytics calculations

https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_user_history
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_page_history
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Metrics

Adding some subscribers that can provide input as to whether this is a good idea or not.

• Nuria added a project: Research.Nov 2 2018, 6:09 PM

• Nuria added subscribers: Halfak, Nemo_bis.

Let's make sure to collect user cases for this and talk to research

leila subscribed.May 21 2019, 8:12 PM

• Nuria renamed this task from Release edit data lake data as a public json dump to Release edit data lake data as a public json dump /mysql dump, other?.May 28 2019, 9:18 AM

• Nuria updated the task description. (Show Details)

leila edited projects, added Research-Freezer; removed Research.Jul 11 2019, 4:13 PM

Ottomata assigned this task to Milimetric.Jul 22 2019, 3:40 PM

Ottomata raised the priority of this task from Medium to High.

Ottomata added a project: Analytics-Kanban.

Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.Jul 23 2019, 3:34 AM

@leila / @nettrom_WMF: fyi I'm working on this now. I've started a draft page where I'm thinking out loud about how to publish: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Public

Any thoughts / use cases / wishes are welcome.

Milimetric reassigned this task from Milimetric to mforns.Jul 25 2019, 4:49 PM

Milimetric subscribed.

mforns moved this task from In Progress to Paused on the Analytics-Kanban board.Jul 29 2019, 2:52 PM

@Milimetric nice to see that we're here. :) I did one pass over the wikitech page. What kind of input is most helpful for you?

@leila Best would be uses cases and how would you expect to use this data which informs our ideas as to how to release it.

@Milimetric @Nuria: Ok. And by when do you need this input?

@leila we are working on this for the next couple of weeks so the sooner the better

@leila: we can of course iterate on the format in the future. Eventually we'll have a public API to query the whole dataset. But for now we just want some idea of common / high priority use cases that we can try to serve with a simpler release. Thank you so much for looking into it.

mforns moved this task from Paused to In Progress on the Analytics-Kanban board.Aug 2 2019, 7:01 PM

Rough draft of a blurb about why this dataset is useful:

NOTE: A history of activity on Wikimedia projects as complete and research-friendly as possible. We add context to edits, such as whether they were reverted, when they were reverted, how many bytes they changed, how many edits had the user made at that time, and much more, all in the same row as the edit itself. So you can focus more on what you want to find out instead of spending your time joining tables, writing code, and pre-processing large amounts of data.

(updated note per @nettrom_WMF's suggestions below)

Change 528504 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] [WIP] Add spark job to create mediawiki history dumps

https://gerrit.wikimedia.org/r/528504

gerritbot added a project: Patch-For-Review.Aug 6 2019, 3:57 PM

nettrom_WMF mentioned this in T230044: Add page protection status to MediaWiki history tables.Aug 7 2019, 4:59 PM

+1 to @leila's rough draft. Could add something to the last sentence to emphasize that these datasets removes the need for additional processing, to make the point about adding context stronger? E.g:

So you can focus more on what you want to find out instead of spending your time joining tables, writing code, and pre-processing large amounts of data.

One thing I was thinking about with regards to the file format of the dumps is that a lot of the Wikipedia research I've seen that studies multiple wikis select them based on their number of articles. I suspect that largely correlates to amount of activity, meaning that if we group wikis by number of articles rather than number of events the list looks mostly the same.

One thing we'd definitely want is some place to look up where to find a given wiki, though, so those who want to use these datasets can easily figure out which files to get based on what wiki they're studying.

Change 530002 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add Oozie job for mediawiki history dumps

https://gerrit.wikimedia.org/r/530002

• Nuria mentioned this in T230642: Publish aggregated reading time dataset .Aug 17 2019, 9:36 AM

mforns moved this task from In Progress to Paused on the Analytics-Kanban board.Aug 23 2019, 3:47 PM

• Nuria merged a task: T214043: Make edit data lake data available as a snapshot on dump hosts.Aug 25 2019, 4:11 PM

mforns moved this task from Paused to In Progress on the Analytics-Kanban board.Sep 12 2019, 5:11 PM

mforns moved this task from In Progress to In Code Review on the Analytics-Kanban board.Sep 12 2019, 8:50 PM

See the final format of the dumps, chosen after the community survey, here: T224459#5491080

Change 528504 merged by jenkins-bot:
[analytics/refinery/source@master] Add spark job to create mediawiki history dumps

https://gerrit.wikimedia.org/r/528504

Change 530002 merged by Joal:
[analytics/refinery@master] Add Oozie job for mediawiki history dumps

https://gerrit.wikimedia.org/r/530002

JAllemandou moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.Sep 18 2019, 12:10 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 18 2019, 12:10 PM

JAllemandou moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Sep 18 2019, 7:34 PM

• Nuria moved this task from Done to Ready to Deploy on the Analytics-Kanban board.Sep 19 2019, 6:27 PM

Change 538312 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Rsync analytics mediawiki history dumps to dumps.wikimedia.org

https://gerrit.wikimedia.org/r/538312

gerritbot added a project: Patch-For-Review.Sep 20 2019, 6:07 PM

How big are these dumps for one set, and how many sets do we intend to keep? Adding @Bstorm since the host behind dumps.wikimedia.org is a WMCS server.

@ArielGlenn thanks for chiming in!

How big are these dumps for one set, and how many sets do we intend to keep?

Each dump set consists of about 2000 files, the biggest of them isn't bigger than ~2GB.
The total size of a dump set is around 440GB. And we'd like to keep 4 to 6 of them if possible.
Do you see any issues in that?
Cheers!

That size seems fine for the current disk available.

Cool!
Thanks @Bstorm

Change 539151 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::data_purge: Add timer to delete old MWH dumps

https://gerrit.wikimedia.org/r/539151

Change 538312 merged by Ottomata:
[operations/puppet@production] Rsync analytics mediawiki history dumps to dumps.wikimedia.org

https://gerrit.wikimedia.org/r/538312

Change 539374 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] dumps::manifests::web::fetches::stats: correct path for mediawiki history

https://gerrit.wikimedia.org/r/539374

Change 539374 merged by Ottomata:
[operations/puppet@production] dumps::manifests::web::fetches::stats: correct path for mediawiki history

https://gerrit.wikimedia.org/r/539374

Change 540442 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] web::fetches::stats.pp Absent mediawiki history rsync

https://gerrit.wikimedia.org/r/540442

Change 540442 merged by Ottomata:
[operations/puppet@production] web::fetches::stats.pp Absent mediawiki history rsync

https://gerrit.wikimedia.org/r/540442

• Bstorm added a subscriber: • bd808.Oct 3 2019, 8:55 PM

• Nuria closed subtask T224459: Recommend the best format to release public data lake as a dump as Resolved.Oct 9 2019, 11:28 PM

mforns moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Oct 10 2019, 6:23 PM

• Nuria closed this task as Resolved.Oct 24 2019, 7:08 PM

• Nuria set the point value for this task to 8.

Change 539151 merged by Ottomata:
[operations/puppet@production] analytics::refinery::job::data_purge: Add timer to delete old MWH dumps

https://gerrit.wikimedia.org/r/539151

Maintenance_bot removed a project: Patch-For-Review.Dec 20 2019, 6:10 PM

Change 559926 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::data_purge: Correct timer syntax

https://gerrit.wikimedia.org/r/559926

gerritbot added a project: Patch-For-Review.Dec 20 2019, 6:12 PM

Change 559926 merged by Ottomata:
[operations/puppet@production] analytics::refinery::job::data_purge: Correct timer syntax

https://gerrit.wikimedia.org/r/559926

Maintenance_bot removed a project: Patch-For-Review.Dec 20 2019, 7:10 PM