It would be very useful to have the user's historical edit count attached to each event. This would help with things like bucketing by user experience (for example, how does the rate at which a user is reverted change with their experience level?) and checking whether a user belonged to a 'virtual' user group like autoconfirmed or extendedconfirmed when they made their edit.
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
analytics/refinery/source | master | +363 -137 | Add new fields in mediawiki_history job |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T161147 Provide cumulative edit count in Data Lake edit data | |||
Resolved | Milimetric | T169782 Troubleshoot issues with sqoop of data not working for big tables |
Event Timeline
An approach to do this would be to add a new computation step that computes this values in a smaller denormalized dataset that is not split per year. Once we calculate "edit counts per user" this data can be joined with the partitioned (data is partitioned per year) dataset. Issue might be that this "smaller" set where we do the computations might be too big to fit in ram. If so, this approach would not work.
We would need to test whether this approach is feasible in enwiki.
If this approach works:
- unit test would need to be added
If it doesn't:
- we would need to compute the "edit count" as part of the initial computation
We can point task as if this approach worked.
Change 359019 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Add new fields in mediawiki_history job
Change 359019 merged by Mforns:
[analytics/refinery/source@master] Add new fields in mediawiki_history job
@Neil_P._Quinn_WMF We have added the cumulative edit count, would you be so kind to do some vetting of data (we have done some ourselves but additional verification is always nice)
@Nuria, I really apologize; this got swallowed by other work and then Wikimania.
I did some spot-checks today, and everything seems right so far. But I want to do some systematic random checks before calling this done; I hope to get to them by next week. Let me know if this work has any dependencies—as far as I know, there aren't any (though of course I still understand your desire to get this checked off reasonably quickly).
Also, a quick question: is this a good way to get a user's latest edit count using the data lake, or can you think of a faster one?
select max(event_user_revision_count) from wmf.mediawiki_history where event_user_text_latest = "{name}" and wiki_db = "{wiki}" and snapshot = "2017-07";
@Neil_P._Quinn_WMF given that user's latest edit count is cumulative the latest count for user will always be the maximum, right?
select event_user_revision_count from wmf.mediawiki_history event_user_text_latest = "{name}" and
wiki_db = "{wiki}" and snapshot = "2017-07" order by event_user_revision_count limit 1;
Not sure if this would be faster but it seems it might. Will try to check.
@Neil_P._Quinn_WMF : This task is part of a quarterly goal, so we would like it to be fully resolved before the end of September. Thanks!
I actually started working on the systematic random checks today, but it looks like that will be a relatively large project which I unfortunately don't have time for.
Based on my spot-checks, this looks done correctly.
It actually took about half the time of my query (1 min, 8 s vs. 2 min, 20 s) when looking at a single user. Nice, I would not have guessed :)
It does need a desc in the order by clause though:
select event_user_revision_count from wmf.mediawiki_history where event_user_text_latest = "{name}" and wiki_db = "{wiki}" and snapshot = "2017-07" order by event_user_revision_count desc limit 1;