Page MenuHomePhabricator

identify a reliable user field for use with the revision, page...maybe also the actor table
Closed, ResolvedPublic

Description

Is the revision_table's rev_user field reliable? Who is a good person to connect with about this?

Context & use-case: I am working to create a query to identify the number of editors that edited each article within a predefined list of articles, in a particular wiki, using replicated databases on MariaDB.

After looking at the following tables,
https://www.mediawiki.org/wiki/Manual:Page_table
https://www.mediawiki.org/wiki/Manual:Revision_table
https://www.mediawiki.org/wiki/Manual:Revision_actor_temp_table
https://www.mediawiki.org/wiki/Manual:Actor_table
https://www.mediawiki.org/wiki/Manual:Logging_table
https://www.mediawiki.org/wiki/Manual:User_table

it appears that the best tables to use for this task are the page_table and revision_table. Within the 'revision_table`, I'm interested in the rev_user field.
As I see that the revision table is undergoing, or will undergo, further changes given T161671, T188327, T215466, T167246 and also given issues with rev_usernoted in T112384, I wonder about whether I should be using rev_user after all and whether I should wait to use rev_actor or actor_id when those are ready.

Any feedback and clarity are appreciated. Thank you!

Event Timeline

Who is a good person to connect with about this?

Platform Engineering probably. Once we pick it up (which I did already), we'll move it around on our workboards (which I did).

Is the revision_table's rev_user field reliable?

The revision table's rev_user field is deprecated, is always 0 for new revisions, and will be removed in the future (see T161671). So, no, it's not reliable.

Context & use-case: I am working to create a query to identify the number of editors that edited each article within a predefined list of articles, in a particular wiki, using replicated databases on MariaDB.

In that case you may be interested in looking at T231598: Compose Count Queries. CPT very recently defined such queries as part of work on MediaWiki REST API endpoints (specifically T231590: Implement GET Edit Count).

Your answer depends on whether you want to count IP edits by distinct IPs (and imported edits not attributed to a local user) as distinct editors, or only count registered editors. In the former case you could join revision to revision_actor_temp and count distinct values for revactor_actor for each rev_page, as in the "total number of unique editors" query in T231598#5465711. In the latter case you'd join revision to revision_actor_temp to actor and count distinct non-null actor_user, similar to the query incorrectly labeled "anonedits: total number of anonymous edits" in T231598#5465711 (that query actually counts unique unregistered editors; you'd change the actor_user IS NULL to actor_user IS NOT NULL).

Note in either case you may want to take into account rev_deleted to avoid leaking information as to how many distinct users were involved in revision-deleted edits, as shown in the queries on T231598#5465711 linked in the previous paragraph.

At some point in the future the revision_actor_temp table will go away, and revision_actor_temp.revactor_actor will become revision.rev_actor. That's tracked in T215466, and specifically that would need to be done in your queries once step 4 is complete and before step 6 is started. We don't have a timeframe for when that will happen.

If you have any more questions, feel free to reply here. (:

Thank you for the information, this is helpful.