Page MenuHomePhabricator

Spurious Amazon clicks / Banners on googleweblight.com
Closed, ResolvedPublic2 Estimated Story Points

Description

In drupal.contribution_tracking I noticed a much higher number of visits to the Amazon form coming from our mobile banners than expected, with few leading to a completed donation. Strangely the same wasn't happening on iPad or desktop targeted banners.

Digging in, I found that most had the referrer googleweblight.com/. This appears to be a conversion service Google provide for slow connections (documentation). Visiting https://googleweblight.com/?lite_url=https://en.wikipedia.org does in fact bring up a broken banner, and clicking anywhere on it takes me to the Amazon form.

Without knowing anything about how Google is munging the code to break it so badly, I think the best option is to just hide banners viewed through this site for now. I can do this using CSS, but thought tech might want to investigate a better solution.

(erring on the side of caution by making this private, feel free to open it up if you think that's okay)

Event Timeline

Ooh, it's even worse. The conversion they do somehow makes the entire page link to the amazon form, moving the link outside the CentralNotice div. I have emailed googleweblight@google.com for guidance, and cc'd fr-tech.

Pcoombe triaged this task as High priority.Dec 12 2016, 3:33 PM
Pcoombe added a subscriber: AndyRussG.

This is still a problem. Google haven't replied, and I've not been able to fix using CSS or javascript within the banner because of how badly the code gets mangled.

@AndyRussG Can we just fully disable CentralNotice when the user is on googleweblight.com? I think that's going to be the easiest solution.

DStrine subscribed.

I moved this back into our triage column. We'll look at it today.

@AndyRussG I tried adding the following js, but it didn't work. The banner still gets displayed.

if ( document.location.hostname === 'googleweblight.com' ) {
    mw.centralNotice.bannerData.hideResult = true;
}

if (!mw.centralNotice.bannerData.hideResult) {
    fundraisingBanner.showLargeBanner();
}

Another reason not to show them a banner: they're getting geolocated to the US, when it looks from x-forwarded-for headers like a large percentage of them are actually in other countries.

I think we're not running any of our JS in the browsers of end-users when they see pages munged like this. So, we have to block somehow when Google is scraping our content. We could do this server-side or client-side (i.e., in whatever browser or spider Google is using)... It's just that we need to know how to detect it, and if it's client-side, ensure the detection doesn't slow things down for everyone else.

From the doc:

If you do not want your pages to be transcoded, set the HTTP header "Cache-Control: no-transform" in your page response. If Googlebot sees this header, your page will not be transcoded.

Dunno if we'd want to block transformations for the entire site... We could easily add this header when we serve banners from Special:BannerLoader, though. I don't know if that would make Google not load the banner, or somehow munge it less... Or maybe it wouldn't make any difference at all.

@Pcoombe Could you try hiding with a condition like this, maybe? Thanks!!!!


If ( navigator.userAgent.indexOf( 'googlweblight' ) !== -1 ) {
   ...hide...
} else {
   ...don't hide...
}

Also, I wouldn't be surprised if it doesn't take some time for a change to appear... I don't know how often Google scrapes content for this (if that's what they do). (I tried viewing a test site on a VPS through googleweblight, but it showed a blank page, and I didn't see any requests from google in the server log.)

In beacon/impression server logs for December 7, I found 2,652,987 requests (quite a lot!!! :( ) with the following user agent string:

Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19

We should probable take a more hardened route and add a Varnish rule to block this UA for Special:BannerLoader... What do u think?

awight raised the priority of this task from High to Unbreak Now!.Dec 13 2016, 7:04 PM

UBN'ing due to the huge number of people having a bad experience.

Okay, hiding based on userAgent appears to be working in a test banner on aa.wikibooks.org. I'll roll this out to our other banners and hopefully we'll start to see a drop in impressions/Amazon clicks.

@Pcoombe Fantastic, thanks!!! We are also working on a server-side Varnish-based option... But, I imagine the in-banner option will roll out faster, so I think going ahead with that is great :)

I tried viewing a test site on a VPS through googleweblight, but it showed a blank page, and I didn't see any requests from google in the server log

Correction: I was using the wrong URL. Now, I am seeing a request in the server log for every view through the service...

Hmmm... I added the same code

if ( navigator.userAgent.indexOf( 'googleweblight' ) !== -1 ) {
   mw.centralNotice.bannerData.hideResult = true;
}

to the mobile banners currently up in Big English about an hour ago. While this worked on aa.wikibooks.org, it doesn't seem to be making a difference yet at https://googleweblight.com/?lite_url=https://en.m.wikipedia.org. Faking the googleweblight user agent for my browser at https://en.m.wikipedia.org does hide the banner, so it doesn't look like an issue with caching on our end.

Pcoombe changed the visibility from "Custom Policy" to "Public (No Login Required)".Dec 13 2016, 10:37 PM

Confirmed that the userAgent seen by JS is the same as that sent in the header: https://googleweblight.com/?lite_url=http://ejegg.com/mediawiki/blight.html

UserAgent is: Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19

Change 327043 merged by jenkins-bot:
Add googleweblight to JS blacklist

https://gerrit.wikimedia.org/r/327043

Good news, I'm no longer seeing a banner at https://googleweblight.com/?lite_url=https://en.m.wikipedia.org, and the number of entries in contribution_tracking with referrer='googleweblight.com/' has dropped hugely today!

It was pointed out in IRC that certain test links like https://googleweblight.com/?lite_url=https://en.wikipedia.org?country=CA still show a banner. For some reason it seems adding that causes weblight to load the desktop site and banners, where I hadn't put in the useragent check. Will add to those now.

Thanks! We're rolling out the patch to treat weblight as a non-JS browser in an hour and 15 minutes, so we shouldn't need any special code in banners after that.

Change 327250 had a related patch set uploaded (by Ejegg):
Add googleweblight to JS blacklist

https://gerrit.wikimedia.org/r/327250

Change 327250 abandoned by Ejegg:
Add googleweblight to JS blacklist

https://gerrit.wikimedia.org/r/327250

Change 327251 had a related patch set uploaded (by Ejegg):
Add googleweblight to JS blacklist

https://gerrit.wikimedia.org/r/327251

Change 327252 had a related patch set uploaded (by Ejegg):
Add googleweblight to JS blacklist

https://gerrit.wikimedia.org/r/327252

Change 327251 merged by jenkins-bot:
Add googleweblight to JS blacklist

https://gerrit.wikimedia.org/r/327251

Change 327252 merged by jenkins-bot:
Add googleweblight to JS blacklist

https://gerrit.wikimedia.org/r/327252

Mentioned in SAL (#wikimedia-operations) [2016-12-14T19:32:45Z] <thcipriani@tin> Synchronized php-1.29.0-wmf.6/resources/src/startup.js: SWAT: [[gerrit:327252|Add googleweblight to JS blacklist]] T152602 (duration: 00m 41s)

Mentioned in SAL (#wikimedia-operations) [2016-12-14T19:34:20Z] <thcipriani@tin> Synchronized php-1.29.0-wmf.5/resources/src/startup.js: SWAT: [[gerrit:327251|Add googleweblight to JS blacklist]] T152602 (duration: 00m 39s)

Just checking the impression data in Hive... It looks like the in-banner JS fix didn't work, but the core patch did. It's still a little early to confirm the latter. I'll check again in the morning when we have more data.

Regarding the in-banner JS, I would have expected to see impressions with statusCode 5 if the banner had been successfully hidden by that method. @Pcoombe, could you please point me to some versions of banners that included that code? (I think it's not in the banners that are up now? Or maybe I missed something?)

Thanks!!!

It looks like the in-banner JS fix didn't work, but the core patch did.

Another possibility is that it did work, but it's not setting status for beacon/impression as expected when it hides the banner.

Strangely, I'm seeing a number of waitimps statuses (2.3) mostly below 400 imp/hr, but with two spikes, up to 10600 imp/hr and 8800 imp/hr, at 15 hrs and 19 hrs today, respectively. So, somehow these bots must be storing data between "views". Bizarre! Maybe they store cookies or localstorage values "on behalf" of end users served?

Hmmm, the numbers are down substantially (about 20% of where we were yesterday) but there are still a lot of impressions coming in via this proxy...

date-hourBANNER_CANCELED.waitimpsBANNER_SHOWNTotal
2016-12-13 00:003503359933949
2016-12-13 01:003605182252182
2016-12-13 02:002976961069907
2016-12-13 03:002539931099563
2016-12-13 04:00171113814113985
2016-12-13 05:00191122873123064
2016-12-13 06:00169132774132943
2016-12-13 07:00146134047134193
2016-12-13 08:00179145211145390
2016-12-13 09:00202130465130667
2016-12-13 10:00224136299136523
2016-12-13 11:00322132398132720
2016-12-13 12:00337137555137892
2016-12-13 13:00411174932175343
2016-12-13 14:00466187772188238
2016-12-13 15:00416167831168247
2016-12-13 16:00481155898156379
2016-12-13 17:00398137099137497
2016-12-13 18:003699609296461
2016-12-13 19:003415932559666
2016-12-13 20:002863285033136
2016-12-13 21:003822173722119
2016-12-13 22:003581984320201
2016-12-13 23:003702323223602
2016-12-14 00:003183332233640
2016-12-14 01:003665342053786
2016-12-14 02:002907082171111
2016-12-14 03:004498014080589
2016-12-14 04:006688709287760
2016-12-14 05:00414100702101116
2016-12-14 06:00257113059113316
2016-12-14 07:00236102252102488
2016-12-14 08:00336106920107256
2016-12-14 09:00353108254108607
2016-12-14 10:00350110286110636
2016-12-14 11:00493112475112968
2016-12-14 12:00548124963125511
2016-12-14 13:00628154582155210
2016-12-14 14:00722175960176682
2016-12-14 15:0010644155499166143
2016-12-14 16:008239193937202176
2016-12-14 17:00350145407145757
2016-12-14 18:00387104971105358
2016-12-14 19:0087921564324435
2016-12-14 20:003707648810195
2016-12-14 21:002461092411170
2016-12-14 22:0021675067722
2016-12-14 23:0022865146742
2016-12-15 00:009259045996
2016-12-15 01:005850915149
2016-12-15 02:004453375381
2016-12-15 03:006851915259
2016-12-15 04:004023012341
2016-12-15 05:003812931331
2016-12-15 06:001314591472
2016-12-15 07:00313031306
2016-12-15 08:00014191419
2016-12-15 09:00117311732
2016-12-15 10:00024532453
2016-12-15 11:00031513151
2016-12-15 12:00039233923
2016-12-15 13:00111648716498
2016-12-15 14:00283024930277
2016-12-15 15:00263364433670
2016-12-15 16:0073622936236

BTW, I'm not seeing any banners on pages I visit through the proxy anymore...

Here are some of the points discussed in IRC about this issue:

  • A server-side strategy was laid out: a Varnish rule could be added for Special:BannerLoader to return hard-coded string with a call to mw.centralNotice.handleBannerLoaderError() instead of any banner content for this client.
  • Blocking ChoiceData might have been a more elegant option, in terms of CentralNotice workflow, however that's normally sent to the client all bundled up with other ResourceLoader stuff, so it'd require splitting the Varnish cache on this UA for all RL content... not really feasible.
  • Since none of our JS is running in the end user's browser, it seemed more reasonable to just turn off Javascript for this client.
  • There's an easy way to do this, and we're already doing it for other similar clients (like Opera Mini).
  • Since the WMF is committed to providing a good reading experience on non-JS clients, and in any case, you can't do much besides read articles via googleweblight, this should be fine.
  • There was some discussion among developers on other teams about whether there might not be some advantages to allowing JS to run. One possible issue is Common.js, a wiki page with JS that often modifies the appearance of the wiki.
  • Also, the proxy is a black box from our perspective, so it's really hard to know what it'll do and what will work.
  • Still, we went ahead with blocking JS on googleweblight.
  • If this client-side solution doesn't work, or doesn't work well enough, we can go ahead with the Varnish method, too.
  • An issue with the proposed Varnish method is that it hardcodes a CN-specific JS call in a completely separate codebase. However, the alternative server-side approach, splitting the Varnish cache on this UA just for Special:BannerLaoder, would be more complex.

The JS blacklist is working!

Till this morning, requests for https://googleweblight.com/?lite_url=https://en.wikipedia.org/wiki/Barack_Obama were coming back with all the images below the fold deleted (since our mobile site only loads them when the user scrolls down), and with the content div's class changed to "client-js" (which our JS does in browsers we deem worthy).

Now, requests to that URL come back with all images included, and with the content div's class "client-nojs", indicating that we're no longer running all the javascript on the proxy.

@Pcoombe, we should be able to take all the special-purpose code out of the banners now.

Thanks for all the help everyone! The random contribution_tracking entries from this referrer are down to practically zero.

@AndyRussG Here's an example banner with the code: https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/Edit/B1516_121315_en6C_mob_p1_lg_dsn_2016
I didn't set a hideReason since afaik those need to be added server-side too, and didn't know that any of the existing ones was suitable. Would that be an issue? Not that it matters much now anyway.

@AndyRussG Here's an example banner with the code: https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/Edit/B1516_121315_en6C_mob_p1_lg_dsn_2016
I didn't set a hideReason since afaik those need to be added server-side too, and didn't know that any of the existing ones was suitable. Would that be an issue? Not that it matters much now anyway.

Thanks much, @Pcoombe! :) That code looks fine. :) It's hard to know what the proxy was doing and why we didn't results with this approach. As has others have said, it's a "black box" for us. Also, it seems they have their own internal caching layers.

(BTW, setting a reason is not required. However, it's also OK to use a new reason string for such cases, I think. The string will be sent back with beacon/impression, and the statusCode should be 5.0, that is, BANNER_LOADED_BUT_HIDDEN.other.)

Looking at beacon/impression data from December 18 (a few days after the fix was deployed, campaign still up) shows exactly 0 impressions from this proxy! :) So, I'm closing the task... Thanks much, all!! :)

Ejegg set the point value for this task to 2.Feb 17 2017, 6:48 PM