Page MenuHomePhabricator

Research what anti-vandalism tools are most commonly used
Closed, ResolvedPublic

Description

Request

The Problem

The ReviewStream feed is designed to encourage adoption of newcomer-friendly edit review features in popular antivandalism tools—while improving edit review overall. The new feed will have no effect, however, if it isn't adopted by these programs, so we want to work with the community to provide designs and technical assistance where necessary. But which tools should we focus our attention on? Which will have the biggest effect on newcomers and on the edit-review process generally? Where do we find the biggest bang for our buck?

To know where to put our efforts, it will be extremely helpful to know what tools are most popular and/or most productive. Because time is short and we know the edit research team is busy with a big project, we are happy to confer on ways we can whittle this job down so that we get enough data to make decisions without having this be a huge effort.

One is that we don't need to look at all possible tools. It's our belief that the following are probably the most relevant:

  • Huggle
  • RTRC
  • STIki
  • LiveRC
  • Any others?

Furthermore, we don't need to have exact figures-- ballpark numbers will do, even if it turns out we can't get apples to apples comparisons.

What Figures Do We Want?

I suppose the most useful feature would be the number of edits (reverts, talk page messages left, Thank-yous....) completed during, say, a given month. Failing that, the number of users per tool?

The complication is that we'd ideally like to know this across the eight wikis that are in the initial target group. These are:

  • English Wikipedia
  • Persian Wikipedia
  • Dutch Wikipedia
  • Polish Wikipedia
  • Portuguese Wikipedia
  • Russian Wikipedia
  • Turkish Wikipedia
  • Wikidata

Deadline

This information would be most useful if we could get it by the beginning of the new year—say, the first week in January.

Response

Searching for additional tools

The hardest part of this request is knowing which tools we want to quantify the impact of; this depends more on local knowledge than on analytic skill.

Seven of the sites on the list are Wikipedias, so I went through the Wikipedia gadget report down to the 2 000 user mark. I didn't find any new ones used for edit review. I also checked the list of default gadgets on Wikipedias and didn't find any there either.

For the English Wikipedia, I reviewed the main list of counter-vandalism tools and checked views of their wiki pages for tools that looked like they could conceivably be active (I excluded Twinkle because it doesn't provide its own patrolling interface). Only Huggle, STiki, and Igloo had significant pageviews. The list of site gadgets didn't include anything related to patrolling.

For Wikidata, I reviewed its lists of external tools and of gadgets and didn't find anything used to patrol new edits. When I looked for thing tagged 'wikidata' in the tools directory, I found one thing: the reCH tool.

Huggle

Huggle is definitely used at the English Wikipedia. Comparing views to its wiki pages across languages, it's worth checking its use on the Russian, Portuguese, Persian, and Turkish, and Dutch Wikipedias too.

It's possible to check the number of Huggle installations by checking for pages like User:<user>/huggle3.css, but that's very time-consuming and is less useful than knowing the number of edits made.

Huggle uses a different edit comment string on each Wikipedia: [[WP:HG|HG]] on English, [[ВП:HG|HG]] on Russian, [[WP:H|Huggle]] on Portuguese , [[وپ:هاگ|هاگ]] on Persian, and [[Project:Huggle|HG]] on Turkish. Dutch seems to have no Huggle activity.

Searching for these strings, it was used to over the past 30 days to make the following:

  • English Wikipedia: 60 405 edits
  • Russian Wikipedia: 184 edits
  • Portuguese Wikipedia: 5 591 edits
  • Persian Wikipedia: 38 edits
  • Turkish Wikipedia: 133 edits

RTRC

Combining data from the gadget report and the script link tool, RTRC has been installed by a significant number of users at the following wikis (note that many may be inactive):

  • English Wikipedia: 5 246 users
  • Persian Wikipedia: 481 users
  • Turkish Wikipedia: 254 users
  • Dutch Wikipedia: 144 users
  • Polish Wikipedia: 88 users

RTRC doesn't use any edit tags or comment strings to mark edits, so there's no way to count them.

STiki

STiki is based at the English Wikipedia, where it includes [[WP:STiki|STiki]] in its edit comments. STiki has a page at the Portuguese Wikipedia, but that gets very few view and the tool wasn't mentioned in any edit comments during the past, so it's clearly not actively used there.

At the English Wikipedia, over the past 30 days, 52 different editors used it to make 23 447 edits.

LiveRC

LiveRC is based at the French Wikipedia (where it has an edit tag), but that's not in the group we're interested in. According to the gadget report, it has been installed by a number of users at the Polish Wikipedia (1 004 users) and Persian Wikipedia (541 users). Note that many of these users may be inactive.

As far as I can tell, neither of these Wikipedias has an edit tag or edit comment string which can be used to identify LiveRC edits.

reCH

reCH, which is a Wikidata-specific tool, tags its edits with OAuth CID: 408 (there's another tag for a previous version, but it wasn't used during the past 30 days). However, it's used for many things other than edit review, so to find reverts you also have to look for edits with the string undid. It can be used to patrol changes, but there's no way to attribute particular patrols to the tool.

Over the past 30 days, it was used to make 72 reverts.

Igloo

Igloo seems to be available on the English Wikipedia only, where it includes [[Wikipedia:Igloo|GLOO]] in its edit comments. Over the past 30 days, it was used to make 74 edits.

Queries used

select count(*)
from enwiki.recentchanges
where
rc_comment like "%[[WP:HG|HG]]%";

count(*)
60405

---

select count(*)
from ruwiki.recentchanges
where
rc_comment like "%[[ВП:HG|HG]]%";

count(*)
184

---

select count(*)
from ptwiki.recentchanges
where
rc_comment like "%[[WP:H|Huggle]]%";

count(*)
5591

---

select count(*)
from fawiki.recentchanges
where
rc_comment like "%[[وپ:هاگ|هاگ]]%";

count(*)
38

---

select count(*)
from trwiki.recentchanges
where
rc_comment like "%[[Project:Huggle|HG]]%";

count(*)
133

---

select count(distinct rc_user)
from enwiki.recentchanges
where rc_comment like "%[[WP:STiki|STiki]]%";

count(distinct rc_user)
52

---

select count(*)
from enwiki.recentchanges
where rc_comment like "%[[WP:STiki|STiki]]%";

count(*)
23447

---

select count(*)
from wikidatawiki.recentchanges
inner join change_tag
on rc_id = ct_rc_id
where 
ct_tag = "OAuth CID: 408" and
rc_comment like "%Undid%"

count(*)
72

---

select count(*)
from enwiki.recentchanges
where rc_comment like "%[[Wikipedia:Igloo|GLOO]]%";

count(*)
74

Event Timeline

https://meta.wikimedia.org/wiki/Gadgets/wikipedia may give you a very good overview of which tools are mainly installed (which supposes they are used).

ToolInstallationsNotes
RTRC5997pt (620), pl (47), nl (15), en (4210), fa (380)
LiveRC43554 different installations
DynamicRC358Spanish only

Neil will certainly have more data.

https://meta.wikimedia.org/wiki/Gadgets/wikipedia may give you a very good overview of which tools are mainly installed (which supposes they are used).

ToolInstallationsNotes
RTRC5997pt (620), pl (47), nl (15), en (4210), fa (380)
LiveRC43554 different installations
DynamicRC358Spanish only

Interesting data! I didn't know of these reports. Additionally, there is also tracking of user-script installations (where a user places an import script command in their personal common.js or global.js page, which doesn't show up as gadget usage).

This is especially common for users who want to enable a gadget globally at once for all wikis they use. They typically install the gadget via their global.js page on Meta-Wiki. It's also common for users on wikis where the gadget is not pre-installed by a local sysop.

https://tools.wmflabs.org/usage/

ToolUsesNotes
RTRC866meta.wikimedia.org (310), nl.wikipedia.org (142), en.wikipedia.org (98)
RTRC-dev34Beta version

@leila, someone said you might have done some work in this area. Please see the task description. Do you have any stats on vandalism tools? Thanks!

@jmatazzoni I don't have stats specific to vandalism tools. The closest work I've been involved in is the research on hoax detection on Wikipedia: https://meta.wikimedia.org/wiki/Research:Understanding_hoax_articles_on_English_Wikipedia

@Neil_P._Quinn_WMF, we are moving forward to create designs for various antivandalism tools, based a bit on hunches and hearsay. Do you think it might be possible to get some data for this request during the first two weeks of January?

JJMC89 subscribed.

User:<user>/huggle3.css could be helpful. It includes that last version used, and the page history will indicate when that user upgraded.

@JJMC89, thanks for the suggestion! So that page is generated for every Huggle user?

@Neil_P._Quinn_WMF Yes, it is generated automatically by Huggle. @Petrb might be able to give you other ideas to track usage.

I suppose the most useful feature would be the number of edits

Note that this is not a measure of "productivity". Doing a lot of edits with high error rate means you're actually increasing human work down the line.

I suppose the most useful feature would be the number of edits

Note that this is not a measure of "productivity". Doing a lot of edits with high error rate means you're actually increasing human work down the line.

True, but we're trying to find out which tools are most used, not which are most productive.

In T152037#2996962, @Neil_P._Quinn_WMF wrote:

True, but we're trying to find out which tools are most used, not which are most productive.

Right. If we do this correctly, we'll be making these tools more productive.

nshahquinn-wmf renamed this task from Research what anti-vandalism tools are most used/most productive? to Research what anti-vandalism tools are most commonly used.Feb 4 2017, 2:50 AM
nshahquinn-wmf updated the task description. (Show Details)

Okay, I think I've come up with as much data as I reasonably can for this. @jmatazzoni, I'll leave the task open for you to track digesting the results. Feel free to ask here if there's anything unclear.

If you want to further contextualize these numbers, I'd suggest the following:

  • For numbers of edits, you can compare it to the overall edit rate on the wiki, which you can get by counting entries from the wiki's recentchanges table where rc_source is mw.new or mw.edit.
  • For numbers of gadget installs, you can find the users who've made those installs by looking at the user_properties tables (the up_property value will start with gadget-) and then see how many of those users have been editing recently by joining with the editor month dataset available on the MariaDB analytics replica.

I'd be happy to guide somebody from the Collaboration team in doing that.

cc @Petrb since he might be interested in the Huggle statistics :)

One other comment that I missed: the lack of consistent data on this seems like a good demonstration of the value of bots and tools using hashtags in their edit comments as in T123636 (for background, see this blog post).

@DarTar, I remember telling you about a year ago that I didn't think hashtags would be very useful to my work. Turns out I was wrong :)

@Slaporte might also be interested.

In T152037#3008097, @Neil_P._Quinn_WMF wrote:

@DarTar, I remember telling you about a year ago that I didn't think hashtags would be very useful to my work. Turns out I was wrong :)

Ha, glad to hear that. I know there's a renewed interest for hashtags, for example programs like The-Wikipedia-Library are using them extensively for evaluating the volume and quality of edits driven by new campaigns.

Let me know if there's anything I can do to help bump this up the priority list.

Based on @Neil_P._Quinn_WMF's numbers above (thanks Neil, I know it wasn't easy!), here's a summary of what appear to be the salient results:

Popularity of antivandalism tools on current ORES wikis

EnglishPersianTurkishPolishPortuguese
Huggle60K edits6K edits
RTRC5K users500 users250 users88 users144 users
STiki20K edits (52 users)
LiveRC500 users1000 users

reCH and Igloo appear to be negligible, so I didn't list them

Notes

RTRC @Krinkle or @Neil_P._Quinn_WMF: Neil's figures show over 5K users on en.wiki, but the Gadget Usage page says 500 users have enabled (and 21 are active n the last month). Any thoughts about the discrepancy? (That number does seem like a lot...)

STiki Here is where we truly grasp the difficulty of comparing installations to edits/mo—or of judging based on installations at all. I'm assuming that Huggle on en.wiki is the big fish, with 60K edits/mo. But STiki's 20K edits—a sizable number— were done by only 50 people. So, for example, are 144 installations on Dutch wikipedia, with its 2 million articles, a lot or a little?

Bottom Line

The goal here was to get a sense of where we'll get the biggest bang for the buck from adding ERI features to anti-vandalism tools. Our analysis is imperfect, as Neil will be the first to acknowledge, but it's the best info we have. Based on what I'm seeing, my sense is that the top-priority targets are:

  1. Huggle ( this graph also provides some evidence here)
  2. RTRC
  3. LiveRC

If anyone has a different analysis or more data, please speak up.

RTRC @Krinkle or @Neil_P._Quinn_WMF: Neil's figures show over 5K users on en.wiki, but the Gadget Usage page says 500 users have enabled (and 21 are active n the last month). Any thoughts about the discrepancy? (That number does seem like a lot...)

That page you linked to looks like it covers usage on that wiki only (i.e. mediawiki.org). If you look at the gadget report for all Wikipedias (search for "RTRC"), you'll find the larger number I cited for enwiki.

STiki Here is where we truly grasp the difficulty of comparing installations to edits/mo—or of judging based on installations at all. I'm assuming that Huggle on en.wiki is the big fish, with 60K edits/mo. But STiki's 20K edits—a sizable number— were done by only 50 people. So, for example, are 144 installations on Dutch wikipedia, with its 2 million articles, a lot or a little?

So I checked into this, and I found out that the numbers given in the gadget reports (so all the "users" numbers in your table) are the number of users with the gadget installed who made at least one edit in the past 30 days. So that removes my first concern about the numbers—that they include a lot of editors who are totally inactive.

However, we still don't know how many of those active users actually used the tool in the past month. It could be a pretty small number, since there's no pressure to uninstall a gadget if you installed it but didn't use it. I installed RTRC for this project and I'm pretty sure I never removed it :)

I don't think there's any data that bears on this, unfortunately. I would say that it's better to compare installs with active users than with articles (so the Portuguese Wikipedia had about 7 700 editors who edited at all during the past month, of which about 140 had RTRC installed), but beyond that I think we've reached the end of the data.

The goal here was to get a sense of where we'll get the biggest bang for the buck from adding ERI features to anti-vandalism tools. Our analysis is imperfect, as Neil will be the first to acknowledge, but it's the best info we have. Based on what I'm seeing, my sense is that the top-priority targets are:

  1. Huggle ( this graph also provides some evidence here)
  2. RTRC
  3. LiveRC

If anyone has a different analysis or more data, please speak up.

This seems like a very reasonable interpretation of the data to me.

@Neil_P._Quinn_WMF notes:

I found out that the numbers given in the gadget reports (so all the "users" numbers in your table) are the number of users with the gadget installed who made at least one edit in the past 30 days.

Point of clarification: by "made at least one edit" here do you mean a) one edit using the relevant tool, or b) one edit in this wiki by any means, or c) one edit in any wiki by any means? From your subsequent remarks, I'm guessing b?

@Neil_P._Quinn_WMF notes:

I found out that the numbers given in the gadget reports (so all the "users" numbers in your table) are the number of users with the gadget installed who made at least one edit in the past 30 days.

Point of clarification: by "made at least one edit" here do you mean a) one edit using the relevant tool, or b) one edit in this wiki by any means, or c) one edit in any wiki by any means? From your subsequent remarks, I'm guessing b?

Yep, b is what I meant. Sorry for the confusion!