Page MenuHomePhabricator

Add captcha for IPs which are generating much monotonous traffic
Open, Needs TriagePublic

Description

Hello, as an idea how to combat T232992 I read about an idea to add a captcha for the 3 articles that are affected.

My idea is to recognize when an IP makes a lot of calls to a few articles.For that formula calculates a number (e.g. stupidly defined call_number_of_last_hour divided by called_lemmata_of_last_hour).
If this number is over let's say 50 (50 calls and one article or 100 calls and two articles) a captcha is switched before the call. If this is successful, the call counts and the captcha is not displayed for the next X calls.

Event Timeline

Der_Keks created this task.Sun, Nov 3, 9:36 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSun, Nov 3, 9:36 PM
Der_Keks renamed this task from Add captcha for IPs which are generating much monotype traffic to Add captcha for IPs which are generating much monotonous traffic.Sun, Nov 3, 9:37 PM

@Der_Keks: How and where to display a captcha for an API call executed via a script outside of a browser?

The first solution is to allow access only via api-token but that's not in that what wikipedia stands for so I would just count human HTTP requests and the API is exepted from the most-viewed counter and the captcha.

@Der_Keks: What are "human HTTP requests"? How do you recognize how "manual"/human or automated a request sent from a device is?

So far I don't see anything technically feasible in this task.

Der_Keks added a comment.EditedMon, Nov 4, 3:39 PM

@Aklapper maybe we don't have the same scope with "API".

For me an API is https://en.wikipedia.org/w/api.php
If this route is used then it's used by an robot == robot action
If https://en.wikipedia.org/wiki/Main_Page then it's usually requested by an user == human action

Based on this theory we could simply say that only https://en.wikipedia.org/wiki/.* is counted on most viewed and restricted trough an captcha like proposed

But maybe you mean that a robot could request https://en.wikipedia.org/wiki/Main_Page and after the 50th try he'll run into the captcha. For that we can simply add a GET-method like "robot=1" to disable captcha AND counting to most viewed.
Example:

https://en.wikipedia.org/wiki/Main_Page -> with captcha restriction and most-viewed indexing
https://en.wikipedia.org/wiki/Main_Page?robot=1 -> without captcha but also without most-viewed indexing

So the only change is that robot owners would need to add the robot argument. And if I am right the goal is to only show human requests on most viewed right?

Der_Keks added a subscriber: Superbass.EditedMon, Nov 4, 7:45 PM

See also T232992#5632739. It is important!

For me an API is https://en.wikipedia.org/w/api.php
If this route is used then it's used by an robot == robot action
If https://en.wikipedia.org/wiki/Main_Page then it's usually requested by an user == human action

@Der_Keks: What makes you think so? I might sometimes manually query the API. I might sometimes run a CLI script to wget or curl the rendered HTML page - hence no user interface that would allow me to enter some captcha somewhere. No "robot" involved, but maybe a manually ran script. :)

Yeah, the magic word is "sometimes" for me. Of course a few techies will still love it to read the articles via API but what we want to show the other users on the main page is not those requests. We wanna show which articles are interesting the society. So I think that's not an argument and we can neglect this <0.00001% case (not counting the requests via api).
You can still run curl or wget but you'll be limited to 50 requests (if you request the same lemma again and again). But why should you do that in one hour as "normal" user? And let's suppose that you have such a case then it's not a "normal" user case and if you want to do more than 50 requests nobody will stop you if you use ?robot=true but then it's a non-human case and we wanna show which articles are interesting the society.

So to sum up your wget>50-times-cases or human-called api requests are not important requests for our goal to reach: To show the society which articles are interesting. In contrary: It improves the most viewed index by filtering out 200 wget-requests by script which corrupt the "what interests society".

An additional idea: If the captcha wasn't entered correctly you can skip it without counting the request