Page MenuHomePhabricator

[REQUEST] Research into usefulness of edits made by IP editors
Closed, ResolvedPublic

Description

What's requested:

  • How many edits are made by IP editors on our projects?
    • Would be nice to have this broken down by project (Wikipedias, wikisources etc)
  • How useful are IP edits on our projects?
    • Some ways to gauge this - how many of IP edits are reverted (good/bad faith)? How many of these IPs are eventually blocked?
    • What is the difference (if any) between IP behaviors on small vs large projects? Like, do IP editors cause more % of vandalism on smaller projects because they don't have as much capacity to deal with it?

Why it's requested: This data will inform the decision to do a pilot for disallowing logged-out editing on one or more projects. It came about as a request on the IP masking project as an alternative and more acceptable solution (for some groups) to doing IP masking.

When it's requested: The sooner the better. 1-2 weeks, ideally.

Other helpful information:

Event Timeline

@nettrom_WMF Created this based on our conversation this morning. Feel free to update if I missed something.

kzimmerman triaged this task as Medium priority.
kzimmerman moved this task from Triage to Next Up on the Product-Analytics board.

@Niharika : I've started working on the first part of this, and am wondering if there's a particular place or format you'd want for the results? Maybe there's a wiki page somewhere that could contain the tables as wiki-tables, so it's easy to refer to in discussions?

As an example, here's what the overall project statistics look like, using data from 1 Sept 2018 to 1 Sept 2019:

database_groupN Total editsN IP editsIP proportionN Registered editsRegistered proportionN Bot editsBot proportion
commons454685374055130.894506302399.111068990523.51
mediawiki520103353476.848475293.215101029.03
meta995589369193.7195867096.2934685734.84
wikibooks4032847374918.2932944681.69128323.18
wikidata26861400110042420.3726760975999.6313781323951.31
wikinews74566381991.173746498.952435370.32
wikipedia1841070562317268912.5916092088587.415683810530.87
wikiquote5083309957319.5940875780.417316814.39
wikisource3569858408031.14352905498.8680774522.63
wikiversity301932254668.4327645191.56162135.37
wikivoyage543516385427.0950497192.91416057.65
wiktionary130201204627683.551255734096.45845850664.96

I'm next planning on splitting this up by group and into data for each wiki within each group, which I'll start on Monday.

Looking good, @nettrom_WMF! Good idea. I setup a page on meta. You can make a section on it and put the data there.

@Niharika : I've updated the page on meta with stats split up by project group (wikisource, wikibooks, etc), and for each language within that group. In addition to monthly averages for number of IP edits, I've also added min/max percentages of IP contributions for each of the 12 months in the dataset, so it's possible to see to what degree the proportion of IP edits varies across a year. I think that should be sufficient to answer the first question in this task.

I'm planning on tackling the other question next. For that, I'm thinking about selecting a few projects (one large, one medium, one smaller) and look at a couple of months of data. I'm unsure whether it'll be possible to identify blocks apart from when a specific IP is blocked, as the usage of range blocks might differ from wiki to wiki and I'm unsure how feasible it is to match IPs against range blocks. I'd also like to use wikis where ORES is available so we can use it to learn whether reverts were reasonable.

Let me know if there are questions or concerns about any of this.

Here's another update: I looked into matching IPs against blocks, and I can do this across the entire dataset in the Data Lake instead of using the replicated databases. This means that I should be able to identify the proportion of IP edits that were subsequently blocked for all wikis that we have data for. Will continue working on that next week.

Here's another update: I looked into matching IPs against blocks, and I can do this across the entire dataset in the Data Lake instead of using the replicated databases. This means that I should be able to identify the proportion of IP edits that were subsequently blocked for all wikis that we have data for. Will continue working on that next week.

Awesome, thanks @nettrom_WMF! That would be great.

@Niharika : I've updated the Research page on Meta with data on reverts, revdeletions, and subsequent blocks across all projects.

I've considered following up the revert analysis by using ORES predictions to understand whether IP edits appear to have warranted being reverted. As far as I understand this has already been done on English Wikipedia by a research team at the University of Washington, so it would mainly be to see whether those results carry across to other Wikipedias.

Another potential follow-up analysis is to look at edit sessions rather than single edits. As the numbers stand, it's easy to interpret them incorrectly because a revert-block scenario is likely caused by multiple edits, meaning the "% reverted blocked" doesn't mean "% IPs blocked". For example, once an IP is blocked it's perhaps lower cost to go through their edit history to identify edits to revert than it is to go through 10 individual IP edits.

At this point, I see both of these as "nice to have" analyses as I'm not sure they would provide any meaningful input to product decisions. Let me know if they do. And please also let me know what questions you might have about the data.

Thanks @nettrom_WMF. I'll go over the research results and setup a meeting to follow-up with you. I'd appreciate your input on the product decision, as someone who has gathered data about this and also worked in close collaboration with researchers thinking about this problem.

@Niharika : Is any additional work needed on this task? If not, I was thinking that it can be closed as resolved.

Yes, we can close this. For posterity, the next steps @nettrom_WMF and I discussed for this was to gather 1-2 small wikis who are willing to experiment with selectively disabling IP edits to gauge the effects of that on the wiki over some time period. This was also posted on wiki.

As I was looking up the code for this task to inform T302941, I noticed that there's no reference to it on this task. For future reference, notebooks are here: https://github.com/wikimedia-research/AHT-IP-edits-2019