Page MenuHomePhabricator

Investigation: How can we identify which editors are semi-experienced editors?
Closed, ResolvedPublic

Description

As an organizer, I would like to know if an editor is a semi-experienced editor (ie an editor who has more than x edits. )

Background: For the event invitation project, we are interested in giving organizers (who have the organizer right) the option to use a tool to event editors to their events. Ideally, these editors should be productive (i.e., they have made a certain number of unreverted edits) and interested in the topics/articles of the event (i.e., they have made significant contributions to the articles in the event). This ticket is meant to investigate if we can generate a list of potential editors to invite to an event.

Acceptance Criteria:

  • Read findings from T346961 as a first step
  • If we have a list of usernames:
    • How do we show only editors who've been productive lately?
      • For instance, this might be defined as editors who have at least 500 unreverted edits on Wikipedia (globally), AND at least X edits in the last N months.
  • Share the following information:
    • Summary of findings in response to the questions above, including:
      • Can we do this?
      • How would we do this?
    • Summary of any major technical issues or concerns to consider
    • Summary of any dependencies or blockers to do this work
  • Investigate and provide a working prototype for generating this list of editors

Timebox: 7 days

Event Timeline

ifried updated the task description. (Show Details)
ifried renamed this task from Investigation: How can we identify which editors are semi-experienced editors? to Investigation: How can we identify which editors are semi-experienced and recently active editors?.Nov 2 2023, 1:52 PM
Daimona renamed this task from Investigation: How can we identify which editors are semi-experienced and recently active editors? to Investigation: How can we identify which editors are semi-experienced editors?.Nov 2 2023, 1:56 PM
Daimona updated the task description. (Show Details)
Daimona updated the task description. (Show Details)

There are two main criteria proposed in this task, and I've looked at them separately. First, for:

at least X edits in the last N months

I checked out core's Special:ActiveUsers (ActiveUsersPager.php) to see if there's something potentially relevant to us. In short, there isn't much. I learned about the 'activeusers' cache in the querycachetwo table and its problems (T221449). Also, that list uses a fixed definition of "active", which is "at least 1 recentchange entry in the last X (30 by default) days". We might be able to copy parts of the activeusers query done in RecentChangesUpdateJob::updateActiveUsers, but the query itself is relatively simple; also, it will depend on our actual definition of "active".

Note that the query is not very fast (but not terribly slow either), and in fact, it's executed in an asynchronous job. In our case, performance may or may not be the same, depending on how the "filter by user" part will be done. At any rate, we should be able to use the (rc_actor, rc_timestamp) index on recentchanges.

Finally, note that we wouldn't be able to use recentchanges at all if the value of N is larger than the recentchanges max age (90 days). In that case, we would have to query revision instead, which is slightly different.


Then we have the following criterion:

500 unreverted edits on Wikipedia (globally)

Interestingly, every word in the above phrase is very relevant:

  • edits: this implies we'd only be checking edits, and not other log actions (such as page moves, file uploads, account creations). This doesn't really change much, and I'm only pointing it out for completeness.
  • globally: this is where things start to get messy. "global" things in MW generally involve poor performance (e.g., running a query separately for each wiki), or subpar code (CentralAuth), or both. The specifics here depend on what we want exactly, but for simplicity, and given the "edits", let's assume we're looking for a global edit count. CentralAuth provides a ::getGlobalEditCount() method; it used to be really slow and expensive (T300075), but this should no longer be the case since the edit count is now tracked explicitly in a DB table. Note that MW core does not provide an abstraction for the global edit count, so we'd have to use CentralAuth directly; we would fall back to the local edit count if CA is not installed.
  • Wikipedia: IIRC, the wording here is intentional, and we'd be looking at Wikipedias only. The immediate consequence of this is that we'd no longer be able to use the fast version of getGlobalEditCount; we would have to go back to the slow wiki-by-wiki approach, by first filtering the list of attached wikis to remove anything but Wikipedia sites. Another consequence is that we would need a fallback for non-WMF usages, but we could use the normal global edit count then. There's also the question of how to distinguish WP sites. This seems to be doable with MWServices->getSiteStore()->getSites() and then getGroup() on the individual sites, cross-referencing the results with the return value of queryAttached(). I came across ContentTranslation's isPotentialTranslator which needs the edit count on Wikipedias only (this one does URL parsing instead of using SiteStore), but that code looks terribly expensive. queryAttached itself is expensive, and the cause of the previous slowness of getGlobalEditCount. See also T167731 for another example.
  • unreverted: this is where it gets from "really slow" to "supermassively slow". I'm not aware of, and could not find, any abstraction in MW core to filter something by unreverted edits; not even for a single wiki, user, or page. I looked for ways to filter a list of edits to only include reverted edits; AFAICS, the only way to do that is to check the change tags associated with a revision, and see if the list includes any of ChangeTags::REVERT_TAGS. But that means that a user's edit count can no longer be retrieved from its dedicated storage (global_edit_count.gec_count for the unfiltered global count, or user.user_editcount for the local edit count also used by queryAttached()). Instead, the edit count would need to be computed on the fly by querying revision; separately for each wiki.

All in all, I think it depends on how important each individual criterion is, and what trade-offs we're willing to make. Also, I don't yet know how this will relate to the other investigations (T350182, T350385). But overall, I think we should be looking at an asynchronous kind of approach, even if we restrict ourselves to the most basic criteria.

We are now generating Invitation Lists following this work along with other research, so I am marking this work as Done.