Page MenuHomePhabricator

Provide cumulative edit count in Data Lake edit data
Closed, ResolvedPublic21 Estimated Story Points

Description

It would be very useful to have the user's historical edit count attached to each event. This would help with things like bucketing by user experience (for example, how does the rate at which a user is reverted change with their experience level?) and checking whether a user belonged to a 'virtual' user group like autoconfirmed or extendedconfirmed when they made their edit.

Event Timeline

Nuria triaged this task as Medium priority.Mar 27 2017, 3:44 PM
Nuria moved this task from Incoming to Dashiki on the Analytics board.

An approach to do this would be to add a new computation step that computes this values in a smaller denormalized dataset that is not split per year. Once we calculate "edit counts per user" this data can be joined with the partitioned (data is partitioned per year) dataset. Issue might be that this "smaller" set where we do the computations might be too big to fit in ram. If so, this approach would not work.

We would need to test whether this approach is feasible in enwiki.

If this approach works:

  • unit test would need to be added

If it doesn't:

  • we would need to compute the "edit count" as part of the initial computation

We can point task as if this approach worked.

Nuria set the point value for this task to 13.Apr 13 2017, 4:11 PM
Nuria changed the point value for this task from 13 to 21.Apr 13 2017, 4:15 PM
Nuria raised the priority of this task from Medium to High.May 29 2017, 3:43 PM
JAllemandou edited projects, added Analytics-Kanban; removed Analytics.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 359019 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Add new fields in mediawiki_history job

https://gerrit.wikimedia.org/r/359019

Change 359019 merged by Mforns:
[analytics/refinery/source@master] Add new fields in mediawiki_history job

https://gerrit.wikimedia.org/r/359019

@Neil_P._Quinn_WMF We have added the cumulative edit count, would you be so kind to do some vetting of data (we have done some ourselves but additional verification is always nice)

Assigning to neil to do final vetting.

Ping @Neil_P._Quinn_WMF again, do you have an ETA when you will be able to do this?

@Nuria, I really apologize; this got swallowed by other work and then Wikimania.

I did some spot-checks today, and everything seems right so far. But I want to do some systematic random checks before calling this done; I hope to get to them by next week. Let me know if this work has any dependencies—as far as I know, there aren't any (though of course I still understand your desire to get this checked off reasonably quickly).

Also, a quick question: is this a good way to get a user's latest edit count using the data lake, or can you think of a faster one?

select max(event_user_revision_count)
from wmf.mediawiki_history
where 
    event_user_text_latest = "{name}" and
    wiki_db = "{wiki}" and
    snapshot = "2017-07";

@Neil_P._Quinn_WMF given that user's latest edit count is cumulative the latest count for user will always be the maximum, right?

select event_user_revision_count from wmf.mediawiki_history event_user_text_latest = "{name}" and

wiki_db = "{wiki}" and
snapshot = "2017-07" order by  event_user_revision_count limit 1;

Not sure if this would be faster but it seems it might. Will try to check.

@Neil_P._Quinn_WMF : This task is part of a quarterly goal, so we would like it to be fully resolved before the end of September. Thanks!

I actually started working on the systematic random checks today, but it looks like that will be a relatively large project which I unfortunately don't have time for.

Based on my spot-checks, this looks done correctly.

Not sure if this would be faster but it seems it might. Will try to check.

It actually took about half the time of my query (1 min, 8 s vs. 2 min, 20 s) when looking at a single user. Nice, I would not have guessed :)

It does need a desc in the order by clause though:

select event_user_revision_count 
from wmf.mediawiki_history
where
event_user_text_latest = "{name}" and
wiki_db = "{wiki}" and
snapshot = "2017-07"
order by event_user_revision_count desc
limit 1;
nshahquinn-wmf raised the priority of this task from High to Needs Triage.
nshahquinn-wmf moved this task from Neil's in progress to Done on the Contributors-Analysis board.
nshahquinn-wmf moved this task from Done to Radar on the Contributors-Analysis board.
nshahquinn-wmf moved this task from Radar to Done on the Contributors-Analysis board.