Provide cumulative edit count in Data Lake edit data
Closed, ResolvedPublic21 Estimated Story Points
Actions

Assigned To

None

Authored By

	nshahquinn-wmf
	Mar 22 2017, 7:24 PM

Description

It would be very useful to have the user's historical edit count attached to each event. This would help with things like bucketing by user experience (for example, how does the rate at which a user is reverted change with their experience level?) and checking whether a user belonged to a 'virtual' user group like autoconfirmed or extendedconfirmed when they made their edit.

Details

	Subject	Repo	Branch	Lines +/-
	Add new fields in mediawiki_history job	analytics/refinery/source	master	+363 -137

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T161147 Provide cumulative edit count in Data Lake edit data
		Resolved		Milimetric	T169782 Troubleshoot issues with sqoop of data not working for big tables

Event Timeline

nshahquinn-wmf created this task.Mar 22 2017, 7:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 22 2017, 7:24 PM

nshahquinn-wmf mentioned this in T150890: Try out the alpha version of Hadoop editing data.Mar 22 2017, 7:42 PM

Jdforrester-WMF subscribed.Mar 22 2017, 10:47 PM

• Nuria triaged this task as Medium priority.Mar 27 2017, 3:44 PM

• Nuria moved this task from Incoming to Dashiki on the Analytics board.

• Nuria moved this task from Dashiki to Operational Excellence Future on the Analytics board.Apr 10 2017, 8:21 PM

An approach to do this would be to add a new computation step that computes this values in a smaller denormalized dataset that is not split per year. Once we calculate "edit counts per user" this data can be joined with the partitioned (data is partitioned per year) dataset. Issue might be that this "smaller" set where we do the computations might be too big to fit in ram. If so, this approach would not work.

We would need to test whether this approach is feasible in enwiki.

If this approach works:

unit test would need to be added

If it doesn't:

we would need to compute the "edit count" as part of the initial computation

We can point task as if this approach worked.

• Nuria set the point value for this task to 13.Apr 13 2017, 4:11 PM

• Nuria changed the point value for this task from 13 to 21.Apr 13 2017, 4:15 PM

• Nuria moved this task from Operational Excellence Future to Wikistats on the Analytics board.Apr 13 2017, 4:18 PM

• Nuria raised the priority of this task from Medium to High.May 29 2017, 3:43 PM

JAllemandou claimed this task.Jun 14 2017, 7:14 PM

JAllemandou edited projects, added Analytics-Kanban; removed Analytics.

JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 359019 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Add new fields in mediawiki_history job

https://gerrit.wikimedia.org/r/359019

gerritbot added a project: Patch-For-Review.Jun 14 2017, 7:14 PM

JAllemandou mentioned this in T168497: Data Lake queries abort with HDFS write fail.Jun 26 2017, 7:10 PM

JAllemandou moved this task from In Progress to In Code Review on the Analytics-Kanban board.Jun 27 2017, 10:09 AM

JAllemandou merged a task: T168497: Data Lake queries abort with HDFS write fail.

JAllemandou added subscribers: • Tbayer, JAllemandou, Milimetric.

Change 359019 merged by Mforns:
[analytics/refinery/source@master] Add new fields in mediawiki_history job

https://gerrit.wikimedia.org/r/359019

JAllemandou moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.Jun 28 2017, 3:04 PM

JAllemandou moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Jun 29 2017, 12:42 PM

JAllemandou moved this task from Done to Paused on the Analytics-Kanban board.Jul 5 2017, 12:15 PM

• Nuria created subtask T169782: Troubleshoot issues with sqoop of data not working for big tables.Jul 5 2017, 5:58 PM

Milimetric moved this task from Paused to In Code Review on the Analytics-Kanban board.Jul 12 2017, 3:57 PM

Milimetric moved this task from In Code Review to Done on the Analytics-Kanban board.Jul 12 2017, 4:11 PM

@Neil_P._Quinn_WMF We have added the cumulative edit count, would you be so kind to do some vetting of data (we have done some ourselves but additional verification is always nice)

• Nuria moved this task from Done to Ready to Deploy on the Analytics-Kanban board.Jul 12 2017, 7:18 PM

• Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.

nshahquinn-wmf added a project: Contributors-Analysis.Jul 13 2017, 6:03 PM

Assigning to neil to do final vetting.

nshahquinn-wmf moved this task from Backlog to Next up on the Contributors-Analysis board.Jul 13 2017, 6:03 PM

Ping @Neil_P._Quinn_WMF

Ping @Neil_P._Quinn_WMF again, do you have an ETA when you will be able to do this?

@Nuria, I really apologize; this got swallowed by other work and then Wikimania.

I did some spot-checks today, and everything seems right so far. But I want to do some systematic random checks before calling this done; I hope to get to them by next week. Let me know if this work has any dependencies—as far as I know, there aren't any (though of course I still understand your desire to get this checked off reasonably quickly).

Also, a quick question: is this a good way to get a user's latest edit count using the data lake, or can you think of a faster one?

select max(event_user_revision_count)
from wmf.mediawiki_history
where 
    event_user_text_latest = "{name}" and
    wiki_db = "{wiki}" and
    snapshot = "2017-07";

nshahquinn-wmf moved this task from Next up to Neil's in progress on the Contributors-Analysis board.Aug 22 2017, 5:15 PM

@Neil_P._Quinn_WMF given that user's latest edit count is cumulative the latest count for user will always be the maximum, right?

select event_user_revision_count from wmf.mediawiki_history event_user_text_latest = "{name}" and

wiki_db = "{wiki}" and
snapshot = "2017-07" order by  event_user_revision_count limit 1;

Not sure if this would be faster but it seems it might. Will try to check.

@Neil_P._Quinn_WMF : This task is part of a quarterly goal, so we would like it to be fully resolved before the end of September. Thanks!

I actually started working on the systematic random checks today, but it looks like that will be a relatively large project which I unfortunately don't have time for.

Based on my spot-checks, this looks done correctly.

In T161147#3569880, @Nuria wrote:

Not sure if this would be faster but it seems it might. Will try to check.

It actually took about half the time of my query (1 min, 8 s vs. 2 min, 20 s) when looking at a single user. Nice, I would not have guessed :)

It does need a desc in the order by clause though:

select event_user_revision_count 
from wmf.mediawiki_history
where
event_user_text_latest = "{name}" and
wiki_db = "{wiki}" and
snapshot = "2017-07"
order by event_user_revision_count desc
limit 1;

nshahquinn-wmf removed nshahquinn-wmf as the assignee of this task.Mar 30 2018, 10:13 AM

nshahquinn-wmf raised the priority of this task from High to Needs Triage.

nshahquinn-wmf moved this task from Neil's in progress to Done on the Contributors-Analysis board.

nshahquinn-wmf moved this task from Done to Radar on the Contributors-Analysis board.

nshahquinn-wmf moved this task from Radar to Done on the Contributors-Analysis board.

nshahquinn-wmf mentioned this in T212172: Provide feature parity between the wiki replicas and the Analytics Data Lake.Dec 19 2018, 9:05 PM

Provide cumulative edit count in Data Lake edit dataClosed, ResolvedPublic21 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Provide cumulative edit count in Data Lake edit data
Closed, ResolvedPublic21 Estimated Story Points
Actions

Related Objects
Search...