Page MenuHomePhabricator

Mediawiki History dumps unique editors feature request
Open, MediumPublic

Description

A user of the dumps of mediawiki history points out that it's computationally expensive to find the number of unique users of each page. I wonder where we stand on making something like this available vs. providing a queryable form of this dataset sooner rather than later. This particular user would be happy if we added three fields: page_unique_registered_editors, page_unique_bot_editors, page_unique_anon_editors. Those could be either counts or actual lists. If we had incremental updates, it would be awesome to store these as unique sets and add to them as edits come in.

Quote from the original email thread, to give more context about how this is useful:

"Thanks for creating the feature request on number of editors. This would be fantastic. The number of editors who intervened in an article is the most correlated feature to the number of interwiki. Topics interesting to many editors in one language are more likely to be translated.

I'm also computing the number of edits on the talk page of the article. To do this, I follow the same procedure of taking the edit_count of the last revision for a specific page_title and page_namespace = 1. I don't think it is necessary to create another field.

I also compute the number of edits done in the last month (which is a similar metric than seconds since the last edit.... to see the amount of recent activity). We have the first_timestamp, I also keep the last_timestamp for every article, but you eventually get this by reading all the revisions and keeping the timestamp (and comparing, now that they do not seem in order).

The different reorderings (by page, by edit timestamp and by user) would all be fantastic. I would only do the reorder by user, because that would facilitate a lot the analysis on each editor lifecycle, and I would include the new features on editor_count (by type). So no need to reorder by page or no need for a lighter dataset including all the unique editors by page. But that's according to some uses I need and I see possible."

Event Timeline

We would need a heuristic about datasize as pages might have a big number of different users. If we implement the users-per page, we should probably implement the pages per-user :)

One thing we suggested in the past to solve this, was having 2 alternative dumps: one ordered by user_id, the other one by page_id.
Not sure if this would cover all use cases that you guys thought.

What Dan is suggesting is the number of editors who intervened in an article. That's very useful.

However, a dataset ordered by user_id would be very useful too to understand the community health and personal preferences. Because we would be able to analyze editors one by one without keeping in RAM memory big amounts of data or decompressing the dump to reorder it, which is not ideal.

So, a) counting the number of editors (and its types) who intervened in an article is fantastic, b) having a dataset like this reordered by user_id is also marvelous.

For research and analytic projects on diversity and community health (I am involved in two), this would really facilitate many processes. In fact, I want to publicly appreciate that this dataset came out, since it makes the job much easier than having to read the XML of the revision dump or massively download data from the replicas.

So thanks again!

Milimetric triaged this task as Medium priority.Jun 4 2020, 3:58 PM
Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

We went over this in our prioritization and it's a bit more complicated. We'll focus on making this dataset available in a queryable fashion on our public APIs, and put this task in our backlog with lower priority. For your planning purposes, that means we probably won't get to it over the next year.

Oh, I see. But these seem two different things. The API will increase the use of some data contained in this dataset. But making it queryable does not mean that it will become easy to obtain the number of editors. It would take querying the whole dataset to obtain the number of editors for all pages, which could be a bit costly, especially for English Wikipedia, or when doing it repeatedly as I have to do for all languages.

I understand that the low-priority also affects creating a second version of the dataset with a user_id sorting.

In any case, I'll be looking forward to any changes and improvements. Thank you.

The second version with a user_id sorting is something we're still considering. But to be clear, the queryable version would come with a cluster powerful enough for you to do operations such as count unique editors per page, even on big wikis. We're glad you brought this to our attention and we're taking the requirement as part of that planning. It's just that we think taking longer to make a more multi-purpose solution is more valuable for more users.

I totally understand. I would only ask you to consider releasing files
(datasets or dumps of any kind) additional to the queryable version,
because I need to use the data from the 300 languages I prefer "not
querying this much". Perhaps if the cluster is powerful enough, it could do
this once and leave the data there? In any case, it is just a suggestion,
you take into account many more factors and interests of other users, that
may be coincident with mine or not.
Thank you.

Missatge de Milimetric <no-reply@phabricator.wikimedia.org> del dia dv., 5
de juny 2020 a les 17:28:

Milimetric added a comment. View Task
https://phabricator.wikimedia.org/T254234

The second version with a user_id sorting is something we're still
considering. But to be clear, the queryable version would come with a
cluster powerful enough for you to do operations such as count unique
editors per page, even on big wikis. We're glad you brought this to our
attention and we're taking the requirement as part of that planning. It's
just that we think taking longer to make a more multi-purpose solution is
more valuable for more users.

*TASK DETAIL*
https://phabricator.wikimedia.org/T254234

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *Milimetric
*Cc: *marcmiquel, mforns, JAllemandou, Aklapper, Milimetric, 4748kitoko,
Akovalyov, terrrydactyl, jeremyb