Page MenuHomePhabricator

Google over-counts en.wikipedia.org pages
Closed, DeclinedPublic

Description

Author: info

Description:
If you search site:en.wikipedia.org for a term that occurs on lots of pages, you
get 153,000,000 hits.

This seems out of proportion to the 1,000,000 English Wikipedia pages. It
implies Google's spider is hitting the site too many times, which can affect
performance and distorts Google's search results.

The site is doing the right thing with its fine use of robots.txt and
noindex,nofollow meta tags on "Edit this page" and "history".

In my experience Google over-counting can happen with endless addition or
modification of query parameters, or other code that tacks on extra URL cruft.
To track it down you either have to scan the server access logs for unexpected
GET requests from Google's spider, or get a Google Search Appliance in-house to
provide more info.

Cheers, just letting you know. I apologize for wasting your time if this is
expected behavior.

(I filed bug 5707 that terms in the footer of all pages like 'privacy' should
not be indexed.)


Version: unspecified
Severity: normal
URL: http://www.google.com/search?hl=en&q=site%3Aen.wikipedia.org%20privacy

Details

Reference
bz5708

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 9:11 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz5708.
bzimport added a subscriber: Unknown Object (MLST).

robchur wrote:

There are a lot more than 1,000,000 pages on the English language Wikipedia.

jeluf wrote:

Google uses sitemaps to index our site, not spiders.

The numbers Google shows are rough guesses and by no means accurate.