Page MenuHomePhabricator

Add captcha for IPs which are generating much monotonous traffic
Closed, DeclinedPublic

Description

Hello, as an idea how to combat T232992 I read about an idea to add a captcha for the 3 articles that are affected.

My idea is to recognize when an IP makes a lot of calls to a few articles.For that formula calculates a number (e.g. stupidly defined call_number_of_last_hour divided by called_lemmata_of_last_hour).
If this number is over let's say 50 (50 calls and one article or 100 calls and two articles) a captcha is switched before the call. If this is successful, the call counts and the captcha is not displayed for the next X calls.

Event Timeline

Der_Keks renamed this task from Add captcha for IPs which are generating much monotype traffic to Add captcha for IPs which are generating much monotonous traffic.Nov 3 2019, 9:37 PM

@Der_Keks: How and where to display a captcha for an API call executed via a script outside of a browser?

The first solution is to allow access only via api-token but that's not in that what wikipedia stands for so I would just count human HTTP requests and the API is exepted from the most-viewed counter and the captcha.

@Der_Keks: What are "human HTTP requests"? How do you recognize how "manual"/human or automated a request sent from a device is?

So far I don't see anything technically feasible in this task.

@Aklapper maybe we don't have the same scope with "API".

For me an API is https://en.wikipedia.org/w/api.php
If this route is used then it's used by an robot == robot action
If https://en.wikipedia.org/wiki/Main_Page then it's usually requested by an user == human action

Based on this theory we could simply say that only https://en.wikipedia.org/wiki/.* is counted on most viewed and restricted trough an captcha like proposed

But maybe you mean that a robot could request https://en.wikipedia.org/wiki/Main_Page and after the 50th try he'll run into the captcha. For that we can simply add a GET-method like "robot=1" to disable captcha AND counting to most viewed.
Example:

https://en.wikipedia.org/wiki/Main_Page -> with captcha restriction and most-viewed indexing
https://en.wikipedia.org/wiki/Main_Page?robot=1 -> without captcha but also without most-viewed indexing

So the only change is that robot owners would need to add the robot argument. And if I am right the goal is to only show human requests on most viewed right?

For me an API is https://en.wikipedia.org/w/api.php
If this route is used then it's used by an robot == robot action
If https://en.wikipedia.org/wiki/Main_Page then it's usually requested by an user == human action

@Der_Keks: What makes you think so? I might sometimes manually query the API. I might sometimes run a CLI script to wget or curl the rendered HTML page - hence no user interface that would allow me to enter some captcha somewhere. No "robot" involved, but maybe a manually ran script. :)

Yeah, the magic word is "sometimes" for me. Of course a few techies will still love it to read the articles via API but what we want to show the other users on the main page is not those requests. We wanna show which articles are interesting the society. So I think that's not an argument and we can neglect this <0.00001% case (not counting the requests via api).
You can still run curl or wget but you'll be limited to 50 requests (if you request the same lemma again and again). But why should you do that in one hour as "normal" user? And let's suppose that you have such a case then it's not a "normal" user case and if you want to do more than 50 requests nobody will stop you if you use ?robot=true but then it's a non-human case and we wanna show which articles are interesting the society.

So to sum up your wget>50-times-cases or human-called api requests are not important requests for our goal to reach: To show the society which articles are interesting. In contrary: It improves the most viewed index by filtering out 200 wget-requests by script which corrupt the "what interests society".

An additional idea: If the captcha wasn't entered correctly you can skip it without counting the request

CC'ing MusikAnimal as I'd love to have more insight / opinions here. In my understanding this task should be declined in favor of T123442: Pageview API: Better filtering of bot traffic on top enpoints.

Der_Keks closed this task as Declined.EditedNov 19 2019, 5:07 PM

@Aklapper Originally I wanted to wait until the current poll on dewiki is finished. The currently running survey on dewiki shows a clear aversion to the whole "beliebt" (popular) section in the app. They want to see the feature hanging!
Instead of waiting for the result, because you're dragging me to act, I close the task as rejected (by the community).

I (my opinion) do not think that T123442 will have success against bot attacks. It runs since almost 4 years, since 1 year no movement and needs much work in contrast to my solution (which is also complicated).

@Aklapper Originally I wanted to wait until the current poll on dewiki is finished.

I dare to say that does not look like the best survey design to me, but anyway. :) Furthermore, if you try to apply the most popular opinion by pleasing the most vocal people, you often end up with a wrong answer. (Disclaimer: I mostly wrote that linked page.)

Well, most vocal people are the ones who have the power in a democracy :)

I believe pageviews go off of the web request logs (basically after the page is viewed), compiled once a day, so I don't know that we could conditionally show captchas based on this metric. I haven't checked yet, but it may also be that the pages listed at T232992 are getting distributed traffic that doesn't come from a single IP. We saw something similar for T158071.

I think T236121: Trending articles is showing pages that had fake traffic should be the focus for now. The mobile apps use a separate API for the top articles, and hopefully we can tweak that for the short-term. T123442 is the upstream bug about the pageviews data itself, which probably needs a more robust, long-term solution.