Page MenuHomePhabricator

Get user/editcount data to determine count at percentiles
Closed, ResolvedPublic

Description

User Story:: As a UX Researcher, I want to know how "typical" a certain edit count among editors who have been active in some timeframe in the last months.

Format: Ideally, I would have a table (CSV) or even just the query that I can put into quarry to generate the table

Columns:

  • (pseudonymous) user (some key/id is sufficient)
  • User edit count
  • Last active on [UNIX DATE] (basically I want to have active users, but it might be helpful to have users who did <5 edits in the last month)
  • ideally, a column indicating if the account is a bot or not (Bot: TRUE | FALSE or so).

This might be a very long table, so if it goes beyond 20MB I'd also take a sampled or some sort of truncated version.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 13 2020, 4:10 PM
Jan_Dittrich updated the task description. (Show Details)Jan 13 2020, 4:10 PM
Jan_Dittrich updated the task description. (Show Details)Jan 13 2020, 4:15 PM

Any related project tag to add here, so this task could be found? :)

Data asked about here is actually pretty generic/makes sense in various projects. But given the original need came up in the Wikidata context, tagging as such.

@Jan_Dittrich

The only thing that I do not understand here is the following planned column:

(pseudonymous) users

Do you need (1) a split between anonymous vs. non-anonymous editors in this column, or (2) a column where each particular editor is represented by one row while the (pseudonymous) user column refers to some ID value which thus anonymizes the real user ID/username? I guess (2)?

I want to know how "typical" a certain edit count among editors who have been active in some timeframe in the last months.

This is doable as a monthly update from the wmf.mediawiki_history table and could be delivered as a Notebook Report + the data set.
I can provide visualizations and statistical summaries of the data set to help you address the "typicality" of particular edit count classes.

a column where each particular editor is represented by one row while the (pseudonymous) user column refers to some ID value which thus anonymizes the real user ID/username? I guess (2)?

Thanks for pointing this out! It just indicates that, while one user should be one row, I do not need the userID or user name, just some key that is unique is the table.
Also (but I think you guessed so already), it would be great if a bot/not bot column would be there OR if bots would be directly exculuded (if I get the data as CSV, I take the extra column)

Jan_Dittrich updated the task description. (Show Details)Jan 14 2020, 8:03 PM
Addshore moved this task from incoming to in progress on the Wikidata board.Jan 16 2020, 7:40 PM

@Jan_Dittrich @WMDE-leszek

  • results with anonymized user_ids shared with @Jan_Dittrich via e-mail (cc: @WMDE-leszek);
  • awaiting feedback;
  • no public results before we ask for a public data set review from the Analytics if this is to go on crontab and produce regular updates.

Thanks! This is what I needed. In case I need precise numbers for area-under-curve or so, I’ll create another task (I "integrated" via cumulative sums in Excel by now, so I might be off a measurement or so, but it is still enough as estimates)

@Jan_Dittrich Great! Would like to have the ETL procedure put on a crontab and run a regular monthly update, or shall we say just ask me when you need the data again?

Jan_Dittrich added a comment.EditedJan 21 2020, 1:12 PM

I'll just ask you – I guess it won't fluctuate widely from month to month.

GoranSMilovanovic closed this task as Resolved.Jan 22 2020, 6:59 AM