Page MenuHomePhabricator

Indexing of https://www.wikidata.org in the Yandex Search Engine
Closed, ResolvedPublic

Description

Hello!

I’m a representative of the leading search engine in Russia, Yandex LLC. (http://www.yandex.ru ). Yandex is the most popular and traffic rich web property on the Russian Internet with more than 24m unique users weekly.

We think that the content of the site https://www.wikidata.org is very important and would be very useful for the users of our search engine system so we would like to index as many pages of it per as possible. Could you please tell us how many requests per second can our crawler make without being blocked? We would like to make at least 10 RPS and more if it is possible.

All of our bots use distinctif User-Agent's. You can see the list here:
https://yandex.ru/support/webmaster/robot-workings/check-yandex-robots.html?lang=en

If you tell us the conditions under which a crawler will not cause overloads while indexing sites on your servers, we will try to make the adjustments needed. Downloading the wikidata dumps might not help in this situation as we need to crawl pages a user sees them.

Event Timeline

А чем конкретно Вас не устраивает дамп?

@PlatonSchukin: Hi and thanks for reaching out! There is https://www.mediawiki.org/wiki/API:Etiquette#Request_limit which does not mention any limits, however, digging into commits I found https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/241643/ which says 50 requests per second (someone who's more into API stuff please correct me if I misinterpret that commit). You may want to ask on the API mailing list if you don't get a reply in the next days here.

@Ghuron: Indeed static dumps exist for wikidatawiki but if API is preferred, API is preferred I'd say. :)

Anomie subscribed.

This has nothing to do with the API itself, it's a question about use of the API. So I'm going to remove MediaWiki-Action-API. It may be that the best way to index wikidata.org isn't to use the API at all, but to use dumps and/or other feeds.

Since Traffic would probably be the ones instituting a block if there were too many requests, that is probably a good place to start. They might also be more familiar with how other search engines do indexing of our sites.

Downloading the wikidata dumps might not help in this situation as we need to crawl pages a user sees them.

Noting they're wanting to crawl the user facing pages (which, will in many cases hopefully already be cached)

jbond triaged this task as Medium priority.Mar 5 2019, 1:09 PM
BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

BCornwall claimed this task.
BCornwall subscribed.

This ticket is quite old and a few answers have been given:

There is no hard and fast limit on read requests, but be considerate and try not to take a site down. Most system administrators reserve the right to unceremoniously block you if you do endanger the stability of their site.

The rate limiting as implemented can also be observed here.

For any further questions, I advise reaching out to WMF via mailing lists as that's a more appropriate place to discuss un-actionable items. Thanks!