Page MenuHomePhabricator

Investigation: Can we find a new search API for CorenSearchBot and Copyvio Detector tool?
Closed, ResolvedPublic3 Estimated Story Points

Assigned To
Authored By
kaldari
Feb 1 2016, 10:44 PM
Referenced Files
None
Tokens
"The World Burns" token, awarded by Luke081515."Heartbreak" token, awarded by Thibaut120094."Heartbreak" token, awarded by TerraCodes."The World Burns" token, awarded by Laberkiste."The World Burns" token, awarded by Poyekhali."Heartbreak" token, awarded by MusikAnimal."Heartbreak" token, awarded by tom29739."The World Burns" token, awarded by Josve05a."Yellow Medal" token, awarded by Qgil.

Description

The Yahoo BOSS API is being discontinued at the end of March. That API is currently used by CorenSearchBot and Copyvio Detector tool. Are there any other search engines that would provide discounted or gratis API access for these tools? Are there contacts at Google, Yahoo, or Yandex that we could talk to?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

If our usage remained at 300,000 queries per month [...]

IIRC these were the numbers for my tool alone. Coren's bot is also very important, and we'd be sharing usage. I can do some work on reducing the queries per check and doing more intelligent query extraction, but that's very very difficult without a large corpus that'll let me test what generates good hits and what doesn't. I don't see how building such a corpus (let alone running tests on it) is feasible, so whatever I end up doing will likely degrade result quality.

Some community members have expressed opposition to us spending any money on a commercial 3rd party search service.

This is very reasonable and it's always bothered me. But until we get a working free or non-commercial alternative, I'd rather have this tool than nothing. Besides, we've been doing it with Yahoo for years.

How does it look to you guys?

Well...

e. Prohibitions on Content. Unless expressly permitted by the content owner or by applicable law, you will not, and will not permit your end users or others acting on your behalf to, do the following with content returned from the APIs:

  1. Scrape, build databases, or otherwise create permanent copies of such content, or keep cached copies longer than permitted by the cache header;
  2. Copy, translate, modify, create a derivative work of, sell, lease, lend, convey, distribute, publicly display, or sublicense to any third party;

I'm not sure.

In the meantime, do either of you have any ideas for what to do until July?

Faroo and Entireweb exist, but they look pretty suspicious.

FWIW, the Yandex Terms seem a lot more flexible than Microsoft's [...]

I'm not sure. Seems like a violation is a violation.

If we are going down that road, I'll look into scraping Google. But it'll take a while.

It looks like the copyvio tool is returning 403 errors now. Is it related to the issue with Bing that is predicted on the tool and links to this ticket?

"Update (2016-04-20): The search tool will likely go down very soon due to issues with Bing and the lack of a suitable replacement." - it then provides a link here.

Yes. Bing was shut off at the end of the month.

Could we keep some kind of statistic how many copyvios were reported, deleted etc. with and without the tool being available? This might be useful later on...

Could we keep some kind of statistic how many copyvios were reported, deleted etc. with and without the tool being available? This might be useful later on...

That would be useful..

FYI, we have another meeting with Microsoft tomorrow.

@kaldari Are we still considering google? Was there a budget problem with using their service?

-Toby

AFC reviewers are now being criticised for missing copyvios. Earwig's copyvio tool is part of our standard workflow.

The meeting with Microsoft was a mixed bag. They basically said that they couldn't offer us any modifications or exemptions from their current terms of service, but there's a chance that our use case isn't actually in violation of their Terms. With that in mind, we discussed some specific parts of the Terms:

  • "If Microsoft includes advertising in Bing results, you will not remove, modify, or interfere with the display or viewing of this advertising." – They told us that this wouldn't be an issue since they will never actually include advertising in Bing search results from the API.
  • "You may, on a non-exclusive, non-transferable basis, use the Services to (i) query the Bing API in response to users’ individual intentionally-initiated internet search queries on Properties; and (ii) display Bing results in a search-like experience on the Properties in response to such queries." – Apparently CorenSearchBot would not be in compliance with this (since it is automatic rather than user-initiated and is not a "search-like experience"), but Earwig's Copyvio Detector would actually be OK.
  • "You will not, and will not permit your users or other third parties to: (c) edit, modify, translate, filter, remove, obscure, truncate, or add to or change the order of, or replace the text, images, or other content of Bing results;" – This is the one that is the most likely to be problematic. They said that they would have to consult their lawyers about whether or not our use case would be in compliance. They said that if it isn't, perhaps we could modify our interface to provide an optional view to the full search results for each of the chunks of the article that were searched.

I told them that we need to get this resolved quickly, and they said they would do their best. We have another meeting scheduled for next Monday.

Re point #2, can we argue that CSB is user-initiated on the principle that a user submitting an article implicitly triggers a check? Maybe bury it in an edit notice when you create a page?

Edit: Oh, "search-like experience". Hrm...

@Tnegrin: Yes, we are still considering Google. If nothing pans out with Microsoft on Monday, it seems like it will be the only option still available. And yes, we'll need to talk about funding for it.

@Tnegrin: Yes, we are still considering Google. If nothing pans out with Microsoft on Monday, it seems like it will be the only option still available. And yes, we'll need to talk about funding for it.

How about an impromptu (is that the right word?) fundraising banner (read 'edit notice') and all money generated goes directly to this...😋

@Tnegrin: Yes, we are still considering Google. If nothing pans out with Microsoft on Monday, it seems like it will be the only option still available. And yes, we'll need to talk about funding for it.

How about an impromptu (is that the right word?) fundraising banner (read 'edit notice') and all money generated goes directly to this...😋

That would be a good idea, but it would have to be not intrusive. Massive banners asking for people to give money would annoy people. A sitenotice could be used too.

It's not the money per se - it's just a budget formality of putting the
money in the right bucket.

@Tnegrin: Yes, we are still considering Google. If nothing pans out with Microsoft on Monday, it seems like it will be the only option still available. And yes, we'll need to talk about funding for it.

How about an impromptu (is that the right word?) fundraising banner (read 'edit notice') and all money generated goes directly to this...😋

That would be a good idea, but it would have to be not intrusive. Massive banners asking for people to give money would annoy people. A sitenotice could be used too.

Am I the only one that thinks this is a really bad idea? Advertising our copyvio detection problem any more than it already is with a large scale notice is pretty much the definition of BEANS. The WMF can fund this without asking for additional money.

@Majora, @Josve05a and @tom29739: Don't worry, the Fundraising team has its own schedule, and they're not going to put up banners for a specific line item in the middle of the year. :) Folks are still working on where the money will come from, but we won't need to host an emergency bake sale.

Agreed, we can handle this without panic.

@Majora, @Josve05a and @tom29739: Don't worry, the Fundraising team has its own schedule, and they're not going to put up banners for a specific line item in the middle of the year. :) Folks are still working on where the money will come from, but we won't need to host an emergency bake sale.

But I like bake sales :D The whole community gets together and try to scam bypassers to buy gross cakes.

Ok, joke aside, I was just thinking as a last way out in case nothing pans out anywhere, we (as community, not WMF) could host own "banners" (not saying that the money would go specifically to copyvio-tools (BEANS), but e.g to community driven projects/tools) and then give all the money to (e.g.) Earwig to pay Google himself.

But, as I said, it was just a strange idea I got, nothing really serious or so.

@Majora, @Josve05a and @tom29739: Don't worry, the Fundraising team has its own schedule, and they're not going to put up banners for a specific line item in the middle of the year. :) Folks are still working on where the money will come from, but we won't need to host an emergency bake sale.

But I like bake sales :D The whole community gets together and try to scam bypassers to buy gross cakes.

Ok, joke aside, I was just thinking as a last way out in case nothing pans out anywhere, we (as community, not WMF) could host own "banners" (not saying that the money would go specifically to copyvio-tools (BEANS), but e.g to community driven projects/tools) and then give all the money to (e.g.) Earwig to pay Google himself.

But, as I said, it was just a strange idea I got, nothing really serious or so.

I can imagine you could get into trouble for misrepresenting what you were doing.

Here's what came out of our meeting with Microsoft today. (For those of you just tuning in, Microsoft says that they can't give us an exemption from their Terms of Use, but they want to help us figure out a way to get our tools in conformance with their Terms.)

We mainly discussed the "edit, modify, translate, filter, remove, obscure, truncate, or add to or change the order of" the results clause, since that one is the most problematic. They basically said that it was up to us what part of each result we want to display (i.e. we can show only the title or URL; we don't have to show the snippet, sublinks, etc.), but we have to somehow display or make available all of the search results. I asked if these search results could be hidden behind an additional UI step (such as clicking on a "View full search results" tab or button) and they said that would probably be OK as long as that UI component was apparent to the user (i.e. we were legitimately offering the search results to the user rather than hiding it as a link in 4pt font-size or white text). They said that even if no one ever actually clicked that button or tab, as long as we were offering it we would be in compliance with the Terms. They also asked if they could see a mock-up or wireframe of any implementation of this so that they could run it by their lawyers to double-check.

Also, they re-iterated that the searches had to be "user initiated", which would seem to rule out a tool like CorenSearchBot unless it was moved to some kind of Tool Labs interface and generated dynamically (which would be very inefficient).

@Earwig: What do you think of Microsoft's proposal? If it sounds amenable, the Community Tech team could probably help with the actual implementation if needed. This sounds like it might be a way forward for your tool at least (which would save over $10K per year compared with Google and avoid the potential of hitting Google's 10,000 query per day limit).

@Ricordisamoa As a service, it seems fairly limited. Maybe in the future? Is there a timeframe?

@kaldari Technically we already show all search results, but some are sorted to the bottom of the results table if they have a really low match % or are in the exclusions database.

So, is reordering the results and mixing them with links-in-the-page the issue here? We could add a JS button to disable that, in theory...

...But I don't like it. Keep in mind we have an API that would also have to be shut down if the "user-initiated" clause needs to be met. And Coren is still screwed, so we need a solution for that.

Would it still cost $10k if only Coren were using Google?

I don't think the copyvios tool actually takes advantage of any Labs-specific features (IOW, the DB replicas). It might be cheaper for everyone if I self-host it and do some sketchy stuff on my end—like scraping Bing directly—so the Labs folks aren't held responsible.

I've got Yandex up and running for now. I set up a proxy on a personal server, since I can't use the Lab's one due to the IP thing.

It seems to work, though results aren't that good.

I actually have a question. Regarding Bing's terms of service, could we argue that CorenSearchBot is user-initiated, since the act of creating a page, which is entirely user-driven, is what causes CorenSearchBot to enter the search?

We could, for example, make the searching an automatic part of actually creating a page, and then claim that it was being initiated manually by the user creating the page. Add a disclaimer to that effect somewhere.

@Compassionate727 It's funny, I asked nearly the exact same question...

Re point #2, can we argue that CSB is user-initiated on the principle that a user submitting an article implicitly triggers a check? Maybe bury it in an edit notice when you create a page?

But even if we successfully argue that, it's still not a "search-like experience", is it?

@Compassionate727 It's funny, I asked nearly the exact same question...

Re point #2, can we argue that CSB is user-initiated on the principle that a user submitting an article implicitly triggers a check? Maybe bury it in an edit notice when you create a page?

But even if we successfully argue that, it's still not a "search-like experience", is it?

My snarky comment on the matter is to make it a search-like experience for the bot.

When I started to think about what my serious comment was, I had a thought about my snarky one. Could we change it so that the tag CSB uses a tool (or program) similar to yours that provides a search-like experience? After all, if your tool is a computer program too, and qualifies as a search-like experience. You make CSB do the same thing as your tool, then it would be the same experience, and therefore a search-like experience. After all, it doesn't say that the computer has to have the same experience that a user does. They don't have the same part in the matter.

Also, nobody answered my question as to how much Google would cost if only Coren's bot were using it.

And finally, would it be possible to have a bot that simply checks for the things that NP patrollers do when looking for copyright violations, like a lack of inline citations (especially in plots), promotional language, and other things? It's not perfect, but it might be possible. Certainly, ClueBot NG's alarming accuracy gives us hope.

@Compassionate727: If only Coren's bot were using Google, I think it would be significantly cheaper, so we should investigate that option.

@Earwig: In order for us to meet the Bing Terms, we would have to provide some optional interface that showed the full search results for each chunk of the article that was searched (in the original order and without filtering via the exclusions database). So something like a "raw results" view. This interface would probably be completely useless for any purpose other than meeting the Terms of Use. What do you think about the possibility of implementing such a workaround? Would it be worth it for the improvement over Yandex?

@kaldari Probably—it's not a big deal to implement—but what about the API?

@kaldari I've also been taking a look at how valuable the various search results are in practice, and with Google I am able to coalesce some of the searches together using operators; reducing the query count by 50% at the cost of a slight increase in false negative results* - if budget is the sole remaining issue that would lighten that load quite a bit.

  • [edit] compared to Google; last I checked the result quality of Google vs Yahoo was noticeably in favor of the former.

@Ricordisamoa As a service, it seems fairly limited. Maybe in the future? Is there a timeframe?

https://about.commonsearch.org/roadmap

@kaldari @Earwig As a consumer of both tools, I can say there have been times where looking at the raw search results would have been useful, to know what was searched and what wasn't from the article text. So if that helps meet Bing's TOS, it wouldn't be wasted effort from my POV.

@Earwig: It looks like everyone thinks the Bing workaround is a good idea. Would you want to build such an interface or would you like for Community Tech to build it?

Is anyone gonna answer my question first?

@Earwig: Honestly I have no idea if Microsoft would consider your API to be a violation of their Terms. I guess I'll have to ask them about that.

I sent an email to our contacts at Microsoft asking about whether or not the API for the copyvio tool would affect our compliance with the Terms of Use.

Reply from Microsoft: "We don’t allow sub-syndication of our API, so unfortunately the Wikipedia API you described would not meet the TOU requirements." I guess it's back to Google then.

Reply from Microsoft: "We don’t allow sub-syndication of our API, so unfortunately the Wikipedia API you described would not meet the TOU requirements." I guess it's back to Google then.

How important is the said API? Do we have any usage statistics? @Earwig

I've made a request for interim funding for using Google's Search API. Hopefully I'll have an API key available by next week.

@eranroz @Earwig What's an estimate of how many queries Eranbot and copyvios sends in a day ? With Google, we'll have a 10,000 query/day limit.

This question was asked and answered above for me. I don't think Eranbot uses anything besides Turnitin; did you mean CSB? At the moment, it looks like usage has dropped from the previous estimate, perhaps because people are less satisfied with the current quality of results. Ballpark is between 1,000 and 4,000 per day.

Yeah, I've pretty much stopped using the tool for now. Hopefully you get something useful up soon.

Eranbot uses only Turnitin.
Based on the last days it does around 1400 queries per day.
Notes:

  • Turnitin is a service for catching copyright violations, unlike google/yahoo/bing which is general purpose web search. This mean that we can post the whole added text to be processed in turnitin (while in general web search I guess we search for randomly sampled sentences)
  • The search is done for each diff/edit with small delay. If we make this delay longer, the search we can aggregate multiple edits to the same article. Since turnitin kindly provided us with many credits we aren't really limited with the queries.

@eranroz Oh, good -- so we don't have to worry about Eranbot, at least. Thanks!

WMF Legal is reviewing the Google API terms of service.

The copyvio text has been deleted so I can't really investigate this.

WMF Legal has approved using the Google API. I'll see about getting this set up next week.

OK, I have a Google API key, but it's currently only good for testing (it's limited to 100 queries per day). I'm working with Finance to set up a billing account so that we can increase it to 10,000 queries per day.

Documentation for the API can be found at https://developers.google.com/custom-search/json-api/v1/using_rest.

@bd808: The API key is restricted to a set of IP addresses, but I can add as many IP addresses as I like. We'll probably want to set up a single-IP proxy like we did for Yandex so that we don't have to constantly maintain this list.

@bd808: The API key is restricted to a set of IP addresses, but I can add as many IP addresses as I like. We'll probably want to set up a single-IP proxy like we did for Yandex so that we don't have to constantly maintain this list.

Hopefully that will be easy. Can you open a new task to track that?

OK, billing is set up. I set the daily quota for the API to the maximum (10,000 queries). However, there is also another quota I didn't know about. Apparently the API is restricted to 100 queries per 100 seconds for each end user. I hope that isn't going to be a problem for Copyvio Detector (@Earwig).

This is ready to be used now. Anyone that needs the Google API key, please ping me on IRC or here.

API documentation:
https://developers.google.com/custom-search/json-api/v1/using_rest#making_a_request

Proxy documentation:
https://wikitech.wikimedia.org/wiki/Nova_Resource:Google-api-proxy

Google works, but unfortunately, it seems we are having some issues with the results themselves.

Example from one of my test cases: the sentence "Some other kinds of replacements are known, but impossible." taken from my website.

Regular search finds it fine (https://www.google.com/#q=%22Some+other+kinds+of+replacements+are+known%2C+but+impossible.%22), but the equivalent API search finds no results.

Even for more common terms, I'm getting strange output. "earwig" doesn't even give me Wikipedia in the top five, but three results from a pest control website and two youtube videos...

There doesn't seem to be much in the way of API documentation on result quality, so I can't see what we're doing wrong here.

Google works, but unfortunately, it seems we are having some issues with the results themselves.

Example from one of my test cases: the sentence "Some other kinds of replacements are known, but impossible." taken from my website.

Regular search finds it fine (https://www.google.com/#q=%22Some+other+kinds+of+replacements+are+known%2C+but+impossible.%22), but the equivalent API search finds no results.

Even for more common terms, I'm getting strange output. "earwig" doesn't even give me Wikipedia in the top five, but three results from a pest control website and two youtube videos...

There doesn't seem to be much in the way of API documentation on result quality, so I can't see what we're doing wrong here.

http://stackoverflow.com/a/20182732/1189399.

Each Google Search API instance is connected with a custom Google Search engine. The search engine that I created for the API is set to search the entire web. It was also restricted to the schema.org page type "CreativeWork", which includes articles, blogs, books, datasets, webpages, and websites according to http://schema.org/docs/full.html. I just removed all filtering from the search engine. See if that makes any difference.

It seems to at least be returning Wikipedia articles in the results now.

Yes, it looks good now. Cheers.

@Earwig: So as you know Google puts a maximum quota on each API instance of 10,000 queries per day. Yesterday we hit 8,281 queries. As people discover that the copyvio detector is back up and running, I'm worried we're going to exceed the quota and it's going to start breaking on people. If you have any ideas for reducing the number of queries, now would be a good time to implement them. You had mentioned a couple of ideas before: talking to the Germans about using the API less, and maybe changing how the articles are divided into chunks. Also MusikAnimal mentioned that maybe we could remove the "copyvios" link from the G12 speedy deletion template on en.wiki if we needed to reduce traffic to the tool, but I would prefer we try other options first.

There is redundancy between the bots. Can we not rerun stuff that Eranbot has already flagged?

James

Actually, I was super confused and it looks like the Copyvios link in the G12 speedy template doesn't do a full search, but a raw comparison of the article with the source. So it doesn't use the Google API, correct?

There are two links:

"This article may meet Wikipedia's criteria for speedy deletion as a copyright infringement (Copyvios report)" and the "(Duplication Detector report · Copyvios report)" one after each URL. The first just does a regular check, but the second pair do comparisons. We can probably kill the first link as long as a URL is provided in the template, since it's not usually necessary.

API usage has declined significantly since yesterday afternoon. It fell off around 5 p.m. Pacific time and has remained low since then. If the current rate remains steady, I don't think we'll be in danger of hitting the daily quota.

API usage has declined significantly since yesterday afternoon. It fell off around 5 p.m. Pacific time and has remained low since then. If the current rate remains steady, I don't think we'll be in danger of hitting the daily quota.

Hopefully that's not related to AuthManager going live on the group1 wikis around that time (2016-06-08T22:26Z).

@kaldari According to my logs, (human) tool usage has remained normal, but API usage completely stopped after Jun 8 at ~22:45 UTC — does this match with your info? If so, it would indicate that the German API users are responsible for the high usage rate. I don't know why they would suddenly stop using it, though, so we can't assume anything.

I also noticed they are using the API with noskip=True, which implores the tool to always make as many queries as possible even if it finds a suspected hit, which they really shouldn't be doing in an automated process.

The last logged hit from CSB on the WP:SCV page was 01:21, 8 June 2016 (UTC). So either the bot is not running, or is not hitting the API properly/successfully. This may account for the dropoff in API hits?

I also noticed they are using the API with noskip=True, which implores the tool to always make as many queries as possible even if it finds a suspected hit, which they really shouldn't be doing in an automated process.

Sounds like we need to talk to them and let them know that the search API is a limited resource. Especially so if we plan on using it for Plagiabot in the not-so-distant-future.

API usage yesterday was only 1,186 queries :)

@kaldari Did the German bots really use that much of it?

API usage yesterday was only 1,186 queries :)

I think this is because Coren's bot is not running, or is being denied by the API from querying.

@Compassionate727 For example I know that MerlBot is using it. He automatically checks new articles.

@Crow: That's weird. Coren's bot was running just a couple days ago. No idea why it would have stopped.