Page MenuHomePhabricator

Copyvio detection tool cannot use Google search engine
Closed, ResolvedPublic

Description

Hello,

I'd like to report that https://tools.wmflabs.org/copyvios has issues with the Google search engine (exhausting the search limit perhaps?).

Maintainer: @Earwig

Tagging Community-Tech per T125459 which is related to search engine problem and Copyvio detection as well.

Related: T194541

Event Timeline

Urbanecm created this task.May 1 2018, 7:38 PM
Restricted Application added a project: User-Urbanecm. · View Herald TranscriptMay 1 2018, 7:38 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Urbanecm triaged this task as Unbreak Now! priority.May 1 2018, 7:38 PM

See the task's description.

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptMay 1 2018, 7:38 PM
Urbanecm moved this task from Backlog to Watching on the User-Urbanecm board.May 1 2018, 7:40 PM
Niharika added a subscriber: Niharika.EditedMay 1 2018, 7:44 PM

Edit: I thought this was for Copypatrol. Copyvios does uses Google search engine.

Maybe you're thinking of CopyPatrol? :) When in "search mode", Copyvios does use Google: https://tools.wmflabs.org/copyvios/?lang=en&project=wikipedia&title=Hanksy&oldid=&action=search&use_engine=1&use_links=1&turnitin=0

This tool is maintained by Earwig. I'm not sure how much we can do?

@Niharika If the tool doesn't use Google, then there's mistake in the UI itself. See the attached screenshot.

I corrected my comment a split second later. :) Sorry about that.

@MusikAnimal If "we" mean the Wikimedia Foundation, you can have a look at remaining credits/queries/whatever Google uses for limiting usage (as this service is paid by WMF, according to T125459). If we mean Community tech, I have no idea as well.

The quota limit is 10k queries per day (pacific time). We've reached our quota for the day. It's not very frequent as apparent from -

So about once a week, apparently. News to me, but sounds about right...

This is unfortunate, but there’s nothing we can do about it, afaik. 10k queries is at most 1250 articles checked a day, less than one a minute on average. It doesn’t allow for a very high tool usage rate.

Wish Google gave a better error message. Suppose we can add our own.

MusikAnimal lowered the priority of this task from Unbreak Now! to High.May 1 2018, 10:55 PM

I'm going to lower to high priority because we're definitely not going to have this addressed by tomorrow (PT time), and by then you'll be able to use the search feature again. I am told we are going to look into increasing the quota.

Unfortunately, we are already at the maximum allowed quota for Google API queries, so there's no easy solution. I've mentioned this to Dan Foy to see if we can get Google to help us work around it somehow.

If this helps, it seems like there was a spike between approximately 11:30 pm to 1:30 am where ~1.5 requests were made per minute.

Maybe it will help to do some sort of throttling on Earwig's tool - for example to not allow more than one request per minute from an IP or something to that effect.

Earwig added a comment.May 2 2018, 3:14 AM

I don't have access to request IPs on Toolforge. Other methods of tracking are creepy/error-prone (or maybe even disallowed?), and I don't want logging in to be required, so it's difficult.

That said, we can certainly try more intelligent throttling if we bake the 10,000 request limit into the tool: for example, we can have a stronger throttle the more requests happen a day, etc, to make sure the quota is spread out. However, it probably won't be very fair, and it's still going to result in unresponsiveness at the user's end.

I have no expectations of Google being particularly generous here, but if they decide to raise it, that would be an excellent solution.

@Earwig_alt Why not require logins? As it stands right now, a bot could exhaust that limit is a few hours. That would be a significant loss, given how extensively this tool is used by our community members. Copypatrol requires login due to similar reasons and we've not had any complaints about that.

Maybe throttling on the proxy side is somehow possible.

MusikAnimal updated the task description. (Show Details)May 14 2018, 1:11 AM
Urbanecm updated the task description. (Show Details)May 18 2018, 7:51 PM
Urbanecm closed this task as Resolved.May 22 2018, 10:05 PM
Vvjjkkii renamed this task from Copyvio detection tool cannot use Google search engine to oudaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from oudaaaaaaa to Copyvio detection tool cannot use Google search engine.Jul 2 2018, 1:56 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot claimed this task.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot edited projects, added Tools; removed Hashtags.
CommunityTechBot added a subscriber: Aklapper.