Page MenuHomePhabricator

CentralNotice contents appearing in Google search snippets
Closed, ResolvedPublic

Event Timeline

Pcoombe raised the priority of this task from to High.
Pcoombe updated the task description. (Show Details)
Pcoombe changed Security from none to None.
Pcoombe subscribed.

Update: all the changes have been pushed from fr-tech, and we're waiting to get re-indexed by google.

In T76743#819639, @atgo wrote:

Update: all the changes have been pushed from fr-tech, and we're waiting to get re-indexed by google.

Mainly, AFAIK:

https://gerrit.wikimedia.org/r/177587 was replaced by a core change and then reverted

Excerpt from #wikimedia-operations

ori> AndyRussG, awight: https://support.google.com/webmasters/answer/66355?hl=en
ori> "Cloaking refers to the practice of presenting different content or URLs to human users and search engines. Cloaking is considered a violation of Google’s Webmaster Guidelines because it provides our users with different results than they expected."
ori> be sure to coordinate, because it sounds like we have multiple fronts here
+K4-713> ori: Definitely many conversations happening at the same time right now. As we're still waiting to hear back from their people, I don't think we want to roll anything back before we actually talk to them.

Copying some comments by @matmarex and @Umherirrender.

Umherirrender said:

See also T36593
I did not like this, because there are so many crawler on the web, which may not using the same comment to ignore content. But maybe for the moment this is a way to go.

Bartosz Dziewoński said:

Same here, I do not like this either. I understand the reasoning (although having a corresponding bug report would be nice…), but I still don't like it. We haven't had to do this before, why do we have to now?

Awight said:

Sorry, I should have commented here that we reverted this change in Ia522064df15b959e22e81abe2c607687ec2a29cb -- the new implementation is Id6bf279e590409b3464c363ee16442f4eb0dc186
Rumor is that Google deployed a new Javascript engine in the past week, apparently it is capable of rendering our banners.
Another theory is that we changed the way we serve banners, and the new endpoint was not disallowed in metawiki/robots.txt.

https://gerrit.wikimedia.org/r/#/c/177611/

What's the status here, @atgo?

For the record, this is what Adam (FR-Tech) pushed out yesterday:

Stop banners showing up in search results
* {{gerrit|177598}} Make spiders ignore BannerLoader and RecordImpression
* {{gerrit|177589}} Prevent Google indexing of the SiteNotice div
* <s>{{gerrit|177669}}</s> (deployed and reverted)
* {{gerrit|177672}}

641cda3e9a8cbe7eefed75f0679057a32e1db0a8 Don't use UNIX timestamp for wgNoticeOldCookieApocalypse config
37473be4b4ca309c1f465e60b95c6f5c51199109 Deprecate old GeoIP HEAD thing
329443f9b3562bc776c6382a09e4552125d21152 No need to quantize throttling any more.
87a5f77199e5b1101a8b5f9de6ad615ef4eaafc2 Move subselects into the main pager query
33ed46b89bd2c1735ff26a85fa004f0c2d2dd79b Simplify Campaign editor banner list
5b4a0802f1cd7046df19cd3d018a86599c7da3de Small fixups
a12c95e830cc8998f0d1aba2339f6611efb36697 Revert "Don't insert banner for bots"

and separately

* 177702 revert admin optimization

tl;dr fr-tech pushed the necessary changes yesterday and we are waiting for google to tell us if there's something else we need to do. From there, they'll still need to re-index the effected pages.

Longer:

  • All the changes we need to make on our end were made yesterday. The most critical change was one to robots.txt, which will tell the google crawler not to load banner content.
  • Google's cached version of robots.txt [1] has not yet been updated to include the critical changes we made yesterday, which are present in the version that is live now [2] (search for "BannerLoader"). Until google updates their cached version of our robots.txt such that it contains the change we made yesterday, the crawler will continue to index articles with the banner text in them.
  • A quick search suggests very strongly that that there isn't anything we can do to force the new version of robots.txt to be used by the google crawler until the cached version expires. There are some suggestions that resubmitting a sitemap through Google Webmaster Tools probably makes that go faster, but if we have any of those things set up for English Wikipedia we definitely haven't heard about it.
  • It is completely unclear to us at this point if our friends at Google are going to manually intervene, and if so, what they are going to do exactly. Presumably, it would start with making sure that the changes we have already made to robots.txt are being used.

[1] - http://webcache.googleusercontent.com/search?q=cache:Uf5lKyfNmTAJ:en.wikipedia.org/robots.txt+&cd=1&hl=en&ct=clnk&gl=us
[2] - http://en.wikipedia.org/robots.txt

In T76743#821738, @atgo wrote:
  • A quick search suggests very strongly that that there isn't anything we can do to force the new version of robots.txt to be used by the google crawler until the cached version expires. There are some suggestions that resubmitting a sitemap through Google Webmaster Tools probably makes that go faster, but if we have any of those things set up for English Wikipedia we definitely haven't heard about it.

There are few ops people who have access to Google Webmaster Tools IIRC.

It's now returning 8,710,000 results for the "DEAR WIKIPEDIA READERS: You're probably busy, so we'll get right to it. This week we ask our readers to help us. This week we ask our readers to protect our site:en.wikipedia.org" search.

The search results are a little strange, taking the pages that are returned for the above search, and attempting to find them using Google and the page title, returns a mix of results. Results for some pages return the "DEAR WIKIPEDIA READERS..." excerpt but other pages return a sensible, expected excerpt, so it's not just a specific search for the fundraising banner text that returns results featuring the banner text.

Is that an indication that Google is gradually updating our results, or is the increasing page count something to be concerned about ?

We have been able to confirm that our fixes (deployed Thursday) solved the issue on our end and that Google picked up the updated robots.txt file on Dec 4. Google is now re-crawling, but this takes time. We've asked contacts at Google to see if there's any way to accelerate.

Searching for 'site:wikipedia.org "Dear Wikipedia readers"' now returns 663,000 results, down from 936,000 yesterday and 1,190,000 on Friday. Woo!

Also, from @Eloquence on wikimedia-l: "Please note that Google uses a distributed index, and depending where you are geographically, and where Google sends you based on server load, you will get inconsistent results from query to query. See this paper for a bit more detail on how these index inconsistencies manifest:

http://cseweb.ucsd.edu/~snoeren/papers/bobble-pam14.pdf"

Just another update - we're down to 318,000 results from the search 'site:wikipedia.org "Dear Wikipedia readers"'. It's been steadily declining since the fix went out a week ago as Google re-indexes.

I'll close this task once the number has gone to 0, hopefully early next week.

In T76743#843129, @atgo wrote:

I'll close this task once the number has gone to 0, hopefully early next week.

@atgo: Any news?

atgo claimed this task.

It looks like this is taken care of and our fix was effective.

It's still possible to get some results as the changes propagate to all of the google servers, but we're confident this will die off shortly.