Page MenuHomePhabricator

Provide cumulative edit count in Data Lake edit data
Closed, ResolvedPublic21 Story Points

Description

It would be very useful to have the user's historical edit count attached to each event. This would help with things like bucketing by user experience (for example, how does the rate at which a user is reverted change with their experience level?) and checking whether a user belonged to a 'virtual' user group like autoconfirmed or extendedconfirmed when they made their edit.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 22 2017, 7:24 PM
Nuria triaged this task as Normal priority.Mar 27 2017, 3:44 PM
Nuria moved this task from Incoming to Dashiki on the Analytics board.
Nuria added a subscriber: Nuria.Apr 13 2017, 4:10 PM

An approach to do this would be to add a new computation step that computes this values in a smaller denormalized dataset that is not split per year. Once we calculate "edit counts per user" this data can be joined with the partitioned (data is partitioned per year) dataset. Issue might be that this "smaller" set where we do the computations might be too big to fit in ram. If so, this approach would not work.

We would need to test whether this approach is feasible in enwiki.

If this approach works:

  • unit test would need to be added

If it doesn't:

  • we would need to compute the "edit count" as part of the initial computation

We can point task as if this approach worked.

Nuria set the point value for this task to 13.Apr 13 2017, 4:11 PM
Nuria changed the point value for this task from 13 to 21.Apr 13 2017, 4:15 PM
Nuria raised the priority of this task from Normal to High.May 29 2017, 3:43 PM
JAllemandou edited projects, added Analytics-Kanban; removed Analytics.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 359019 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Add new fields in mediawiki_history job

https://gerrit.wikimedia.org/r/359019

Change 359019 merged by Mforns:
[analytics/refinery/source@master] Add new fields in mediawiki_history job

https://gerrit.wikimedia.org/r/359019

JAllemandou moved this task from Done to Paused on the Analytics-Kanban board.Jul 5 2017, 12:15 PM

@Neil_P._Quinn_WMF We have added the cumulative edit count, would you be so kind to do some vetting of data (we have done some ourselves but additional verification is always nice)

Nuria moved this task from Done to Ready to Deploy on the Analytics-Kanban board.Jul 12 2017, 7:18 PM
Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.

Assigning to neil to do final vetting.

Nuria added a comment.Aug 15 2017, 4:10 PM

Ping @Neil_P._Quinn_WMF again, do you have an ETA when you will be able to do this?

@Nuria, I really apologize; this got swallowed by other work and then Wikimania.

I did some spot-checks today, and everything seems right so far. But I want to do some systematic random checks before calling this done; I hope to get to them by next week. Let me know if this work has any dependencies—as far as I know, there aren't any (though of course I still understand your desire to get this checked off reasonably quickly).

Also, a quick question: is this a good way to get a user's latest edit count using the data lake, or can you think of a faster one?

select max(event_user_revision_count)
from wmf.mediawiki_history
where 
    event_user_text_latest = "{name}" and
    wiki_db = "{wiki}" and
    snapshot = "2017-07";
Nuria added a comment.Aug 31 2017, 1:50 PM

@Neil_P._Quinn_WMF given that user's latest edit count is cumulative the latest count for user will always be the maximum, right?

select event_user_revision_count from wmf.mediawiki_history event_user_text_latest = "{name}" and

wiki_db = "{wiki}" and
snapshot = "2017-07" order by  event_user_revision_count limit 1;

Not sure if this would be faster but it seems it might. Will try to check.

@Neil_P._Quinn_WMF : This task is part of a quarterly goal, so we would like it to be fully resolved before the end of September. Thanks!

Neil_P._Quinn_WMF closed this task as Resolved.Sep 8 2017, 9:38 PM

I actually started working on the systematic random checks today, but it looks like that will be a relatively large project which I unfortunately don't have time for.

Based on my spot-checks, this looks done correctly.

Not sure if this would be faster but it seems it might. Will try to check.

It actually took about half the time of my query (1 min, 8 s vs. 2 min, 20 s) when looking at a single user. Nice, I would not have guessed :)

It does need a desc in the order by clause though:

select event_user_revision_count 
from wmf.mediawiki_history
where
event_user_text_latest = "{name}" and
wiki_db = "{wiki}" and
snapshot = "2017-07"
order by event_user_revision_count desc
limit 1;
Neil_P._Quinn_WMF removed Neil_P._Quinn_WMF as the assignee of this task.Mar 30 2018, 10:13 AM
Neil_P._Quinn_WMF raised the priority of this task from High to Needs Triage.
Neil_P._Quinn_WMF moved this task from Neil's in progress to Done on the Contributors-Analysis board.
Neil_P._Quinn_WMF moved this task from Done to Radar on the Contributors-Analysis board.
Neil_P._Quinn_WMF moved this task from Radar to Done on the Contributors-Analysis board.