Page MenuHomePhabricator

Make ConfirmEdit's addurl smarter by checking whether similar URLs are already used on the wiki
Open, LowPublic

Description

From https://lists.wikimedia.org/pipermail/wikitech-l/2014-December/079767.html ("[Wikitech-l] Our CAPTCHA is very unfriendly" by Robert Rohde):

I suspect we could weed out a lot of spammy link behavior by designing an external link classifier that used knowledge of what external links are frequently included and what external links are frequently removed to generate automatic good / suspect / bad ratings for new external links (or domains). Good links (e.g. NYTimes, CNN) might be automatically allowed for all users, suspect links (e.g. unknown or rarely used domains) might be automatically allowed for established users and challenged with captchas or other tools for new users / IPs, and bad links (i.e. those repeatedly spammed and removed) could be automatically detected and blocked.

Event Timeline

MZMcBride raised the priority of this task from to Needs Triage.
MZMcBride updated the task description. (Show Details)
MZMcBride added a project: SpamBlacklist.
MZMcBride changed Security from none to None.
MZMcBride subscribed.

This may be a duplicate, but in any case, I think it's a great idea. We already index external link domains, so establishing a threshold (for example, if the same domain name is already used on fifty pages on the wiki, don't trigger a CAPTCHA) should be fairly simple.

MZMcBride renamed this task from Make SpamBlacklist smarter by checking whether similar URLs are already used on the wiki to Make ConfirmEdit's addurl smarter by checking whether similar URLs are already used on the wiki.Dec 10 2014, 3:30 AM
MZMcBride removed a project: SpamBlacklist.

This task may be too tied to ConfirmEdit and may be too broad to be actionable. I'm equivocating. :-(

MediaWiki and its commonly used anti-spam extensions currently take an all-or-none approach to domains. That's probably the broader issue here.

There was talk of implementing logging for when people hit CAPTCHAs. I'm not sure where that ended up, but it would probably be helpful to have in order to examine when CAPTCHAs appear and whether we can mitigate them.

As for blocking users or disallowing links, that's AbuseFilter territory, I imagine.

Changing the focus of new external links from precise URL matches to domain matches would be good. However domain is not indexed in https://www.mediawiki.org/wiki/Manual:Externallinks_table , so a new field might need to be added to make this scale well.

I believe the 'new link CAPTCHA' is triggered on any link which doesnt yet appear in the database. I think many sites would be happy to switch to 'new domain', allowing links to new webapges on existing domains to be added to the wiki without a CAPTCHA.

Changing the focus of new external links from precise URL matches to domain matches would be good. However domain is not indexed in https://www.mediawiki.org/wiki/Manual:Externallinks_table , so a new field might need to be added to make this scale well.

I believe the 'new link CAPTCHA' is triggered on any link which doesnt yet appear in the database. I think many sites would be happy to switch to 'new domain', allowing links to new webapges on existing domains to be added to the wiki without a CAPTCHA.

It would be useful to also add a datestamp to the external links if it is going to be used to whitelist other URLs, so that if a new domain gets through the spam prevention measures, it takes a few days before that domain becomes automatically 'acceptable' to the other spam prevention measures. (sorry if I am suggesting functionality that already exists somewhere).

I've started adding running some simple tests to explore this a bit more. Is there a natural place to discuss link statistics and behavior? Somewhere on Mediawiki.org? I worry that putting too much detail here will be a distraction.

I've started adding running some simple tests to explore this a bit more. Is there a natural place to discuss link statistics and behavior? Somewhere on Mediawiki.org? I worry that putting too much detail here will be a distraction.

Nemo recently created https://www.mediawiki.org/wiki/Extension:ConfirmEdit/FancyCaptcha_experiments. In addition to mediawiki.org, there's wikitech.wikimedia.org and meta.wikimedia.org. But any of them is fine and you can pick any page title you want.

Towards getting the details right, I'd like to verify my understanding of how the Captcha functions with regards to URLs on Wikimedia sites.

I believe the "addurl" captcha mode is triggered each time an anon or new user (not autoconfirmed) adds a new link to any page. As I understand it, duplicating existing URLs on the same page doesn't trigger it, only links that are new to the page. However, no consideration is given to whether the same link may appear on other pages, only whether a link is new to the current page. Also, bot flagged accounts are exempt.

There is a capacity in the code to limit "addurl" triggers to certain namespaces, but I don't believe that Wikimedia generally uses this restriction.

Also, I believe our current autoconfirmed threshold for most wikis is 4 days and zero edits, though enwiki and a dozen other wikis also have a limit on edits (ranging from 10 to 50) in addition to the 4 days. Also, we have a generic rate limit for edits (not related to adding links) that I believe is 8 edits/min for anons and new editors.

Does that description look right?

I'd take this suggestion a step further and propose that the 'rating' be also based on external sites' contents.

Changing the focus of new external links from precise URL matches to domain matches would be good. However domain is not indexed in https://www.mediawiki.org/wiki/Manual:Externallinks_table , so a new field might need to be added to make this scale well.

FYI, there is now a new domain index on the external links table.
I don't think that is particularly useful, as domains likely get listed a lot in the process of spam fighting as well.