Page MenuHomePhabricator

Spike: Monitor deployment rolling back our "googleoff" tag
Closed, ResolvedPublic1 Estimated Story Points

Description

There's some interest in rolling back the change we made during the 2014 Fundraising drive, which prevents Google from indexing CentralNotice banner contents. I haven't been able to find a bug tracker for this incident, but it was a big deal at the time. Google's cache of our site was slowly taken over as articles were crawled, and IIRC, at the peak there were at least several hundred thousand bogus pages. We took several approaches to fixing the problem, and I'm not certain which ones were eventually effective. The bad pages took a month to slowly clear out as they were replaced with good content.

One approach we took was to add <googleoff/googleon> tags around the CentralNotice div, on the off chance that this enterprise search technique would also apply to Google's web search. Another approach was to contact a FoaF at Google and ask them to set up a manual trim regex.

The argument for keeping the googleoff/on change is simply the high likelihood that we might cause the same glitch, and our search results will go crazy.

As best as I can represent it, the argument for reverting the change is that Google might be penalizing us for "cloaking" our pages by showing different content to the indexing bot, vs regular readers. Evidence so far is http://webmasters.stackexchange.com/questions/54735/can-you-use-googleon-and-googleoff-comments-to-prevent-googlebot-from-indexing-p (and earlier thread http://webmasters.stackexchange.com/questions/16390/preventing-robots-from-crawling-specific-part-of-a-page ).

Patch in question is https://gerrit.wikimedia.org/r/#/c/293911/ , please don't deploy until a fr-tech member is available to monitor Google results for 48 hours.

Event Timeline

Change 293911 had a related patch set uploaded (by Awight):
Revert "Prevent Google indexing of the CentralNotice div"

https://gerrit.wikimedia.org/r/293911

Andrew brought up another candidate for what actually stopped the bad indexing: we disallowed Special:BannerLoader in robots.txt.

This could also be considered cloaking, however.

! In T137761, @awight wrote:
As best as I can represent it, the argument for reverting the change is that Google might be penalizing us for "cloaking" our pages

It would be weird for Google to implement a feature and then penalize web-masters for using it. My rationale was simpler: voodoo code (code which is believed to do something but doesn't do anything) is worse than useless, because it's misleading. It's hard to reason about how Google will interact with our site, so it's important not to make it even harder by multiplying entities and variables beyond necessity.

Change 293911 merged by jenkins-bot:
Revert "Prevent Google indexing of the CentralNotice div"

https://gerrit.wikimedia.org/r/293911

Is this deploy slot doable: Thursday 2016-07-07 4-5 pm PDF, in the evening SWAT deploy?

The 48-hour monitor period runs into a weekend...

This was deployed on Wednesday, July 13. I think Fundraising banners have been up relatively consistently since then...? Here are a few searches on Google Israel.. So far, looks good! I don't see any evidence of banners getting into search result summaries.

https://www.google.co.il/#q=%D7%97%D7%AA%D7%95%D7%9C+wikipedia
https://www.google.co.il/#q=pokemon+wikipedia
https://www.google.co.il/#q=%D7%99%D7%A9%D7%A8%D7%90%D7%9C+wikipedia

DStrine set the point value for this task to 1.Jul 20 2016, 10:42 PM

I ran the following search on Google Brazil, with no results. English WP is running Wiki Loves Rio there.

site:en.wikipedia.org cats "upload photos of the olympics"

So this seems safe to close! Yay!!!