Page MenuHomePhabricator

Review top "non-bot" contributors in 2019
Closed, ResolvedPublic


We want to get a sense for how prevalent unidentified bots are among our editors

Event Timeline

The first sheet is top 100 "non-bot" editors using our current bot definition SIZE(event_user_is_bot_by_historical) = 0 and SIZE(event_user_is_bot_by) = 0 among all the wikis in 2019.
Among these 100 editors, 2% had over 10M edits, 6% had over 5M edits, 30% had over 1M edits.

The first sheet is top 100 "non-bot" editors using the bot definition in edits_hourly table. We can see that some bots like Tuanminhbot (over 9M edits), CheWikibot (over 1M edits) have been included in the non-bot edits according to this definition.

    event_user_text as user_name,
    COUNT(*) as edits    
FROM mediawiki_history
    snapshot = "2020-01" AND
    event_timestamp between "2019-01-01" and "2020-01-01" AND
    event_entity = "revision" AND
    event_type = "create" AND
    size(event_user_is_bot_by) = 0 AND size(event_user_is_bot_by_historical) = 0
GROUP BY event_user_text
HAVING edits >= 100000

To better get a sense of the impact (since everyone in the top 100 have > 100,000 in a year), it would be helpful to know what proportion of total "non bot" edits those top 100 represent.

Notebook for ref.

The top 100 non-bot editors from 2019 created 125,640,683 edits, which account for 42.79 % of total non-bot edits.

Excluding Wikidata edits, top 100 non-bot editors created 18,905,560 edits in 2019 which account for 12.63 % of total non-bot non-wikidata edits. 70% of top 100 "non-bot" contributors made most of their edits in Wikidata.

@JKatzWMF It looks like unidentified bots have a large impact on our edits, particularly when Wikidata editors & edits are included. Even when Wikidata is excluded, the top 100 "non-bot" editors make 12.63% of "non-bot" edits.

I think there's a strong case for investigating bots more fully and reconsidering our definitions, but I propose we table this for next FY.

Please review and let me know if we can resolve this ticket. I'll create a separate ticket for follow-up, to go in our backlog for now.

@kzimmerman and @cchen Thank you for this incredible summary!! 12% is huge. Before deciding whether or not to table this, can someone on the team run through this list of the top 30 and decide if they are bots or not? Judging from the first username with 10M edits, more than 1/2 of the edits (or 6% of total) are unidentified bot edits. If this is the only bot on the list, then the short-term solution is simply registering this as a bot. But if there are more in that top 30, then we might need something more. Does that sound fair? Here is a sheet where I started, it's probably an hour of work for someone familiar with edit histories.

@JKatzWMF Connie has a lot on her plate at the moment and I'm trying to cut down her list & that of others on the team so they're not context switching as much. This sounds like a separate ask and I want to limit scope creep. Is there a team that might have more bandwidth and more experience reviewing edit histories?

@kzimmerman Just closing the loop here publicly. Kate and I resolved this offline the week of Feb 17th.

Using Connie's notebook, and looking up individuals, I was able to quickly ascertain that with the exception of one unidentified bot making a huge number of edits, most of the users on the list are real people making semi-automated edits via various bulk tools.

There is a meta issue here about how we categorize semi-automated work, but for the time being, it is clear that these are users and not unidentified bots. We can feel relatively assured that our current bot detection methods are finding bots that might be influencing our high-level metrics.