Page MenuHomePhabricator

Investigation: workaround for ad blockers on Pageviews Analysis
Closed, ResolvedPublic3 Estimated Story Points

Description

See T126947 for context. Apparently the popular ad blockers consider /pageviews, /viewcounts, and other similar routes to be serving ads. This includes the WMF pageviews API itself, so we'll have to "shadow" usage of it in order to get past the ad blockers. Ad blockers are very popular, and it appears some wikis are apprehensive about linking to Pageviews Analysis for this very reason.

This is especially a problem for us as the route of the application itself is /pageviews. This means the assets (CSS/JS) are also blocked by default. A known workaround is to:

  • Use a middleman API service with an acceptable route to make requests to the pageviews API for us
  • Also move our assets to a different route

I did some quick testing by adding a pageviews API wrapper on one of my Ruby apps. This works, but I worry about the overhead of using Ruby, and also my app in particular does not have good caching at the moment. If it did, I can say with some confidence that this won't be too much slower than hitting the actual API directly. Serving the assets for Pageviews Analysis could live at the same place our API wrapper lives.

So my proposal is we look into creating a lightweight PHP app that only acts as an API wrapper, and serves our static assets. This way we could use the default lighttpd server Tool Labs offers, and the assets at least will load just as fast as they do now. For the API wrapper, we'll need some talented PHP developers to put it together, perhaps naming it pv. So the route we'd hit via AJAX from Pageviews Analysis would mimic the actual API and send back the same response, something like:
https://tools.wmflabs.org/pv/per-article/en.wikipedia/all-access/user/Cat/daily/2016021300/2016030300
as opposed to:
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Cat/daily/2016021300/2016030300

Does this seem worthwhile? How difficult would it be to implement the API wrapper? Are there any alternative solutions you can think of?

Possible steps:

  • Move the tool to a different URL?
  • Talk to Ublock about taking this off the list
  • Show a notice to people who can't see part of the page, encouraging them to update ad blocker

Event Timeline

@Nemo_bis, @MusikAnimal: So far, I have not been able to reproduce this. I'm running AdBlock Plus 2.7.2 in Firefox. I'm subscribed to the "Adblock Warning Removal List" and the "EasyList".

@kaldari same here, using AdBlock Plus on Chrome and Firefox. I believe @Niharika said she was able to repro with AdBlock (maybe not Plus)?

I am able to repro with uBlock. The list that it appears to use by default is here, where I see /pageviews along with similar analytics keywords blocked.

Most people are going to see the notice shown and disable ad blockers accordingly, but this is probably the most common complaint we have with Pageviews Analysis. If a workaround is easy enough, and performant, I think we should do it.

I can reproduce with Adblock Plus 1.10.2 on Safari. I should upgrade my adblocker. :)

It looks like uBlock is using the old Adblock Plus 1.1 list. Maybe we could convince them to upgrade to a newer list.

DannyH triaged this task as Medium priority.Mar 11 2016, 6:25 PM
DannyH updated the task description. (Show Details)
DannyH set the point value for this task to 3.
DannyH moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.

Actually, the specific list that is triggering the block is the EasyPrivacy list:
https://easylist-downloads.adblockplus.org/easyprivacy.txt

It looks like several other services are also using this list.

Yeah, and they've blocked other things like "views" and "counter", etc. I'm thinking adding the PHP service on a different URL is our best bet. We don't need to change the URL of our tool, users are able to browser to it, only the assets won't load. That's easily resolved by placing them wherever our PHP service will live.

Anyone confident they could create something like this? At least initially it should be simple... reroute requests to the actual pageviews API and send back the results. We're caching on the clientside, so within the same session subsequent requests won't be any slower than they are now. Given how speedy our current layer of PHP is running, the additional layer of the API rerouting probably won't be too noticeable, I think.

Hmm. I'm a bit skeptical of setting up a middle-man API. Once we add the support for redirects, we're potentially talking about the following process:

  • Ask MediaWiki API for redirects
  • Ask PHP service for stats (after first API request completes)
  • Have PHP service ask stats API for actual stats
  • Return actual stats to PageView Analysis

I just don't imagine that's going to feel very snappy. I have no idea what a viable alternative would be though.

I just don't imagine that's going to feel very snappy. I have no idea what a viable alternative would be though.

I'm not exactly clear if the tool or the RESTBase endpoint is the problem being solved here, but for either/both why not just rename? If killing the old name is too hard you should at least to be able to introduce an alias. "pageviewstats" or similar names aren't on the block list. Generally picking a name that is commonly used by ad services or tracking beacons and then hoping to get it whitelisted for one domain is a losing battle.

I just don't imagine that's going to feel very snappy. I have no idea what a viable alternative would be though.

I'm not exactly clear if the tool or the RESTBase endpoint is the problem being solved here, but for either/both why not just rename? If killing the old name is too hard you should at least to be able to introduce an alias. "pageviewstats" or similar names aren't on the block list. Generally picking a name that is commonly used by ad services or tracking beacons and then hoping to get it whitelisted for one domain is a losing battle.

The problem is with an analytical-like path, it seems. So both the RESTBase endpoint and our CSS/JS assets nested under a /pageviews route. See T126947 for more. Maybe they'd be willing to create an alias, I shall ask! That would solve our problem, as our assets can safely live within another tool without any impact on performance.

I tested a middleman service on my Ruby app and it didn't seem to go terribly slow. That was on my local, the unicorn server I use on Tool Labs will be faster. I can set it up and get /pageviews-test running on it so we can see if this is worthwhile, knowing a PHP implementation would likely be faster.

Give http://tools.wmflabs.org/pageviews-test a whirl, which is now using a middleman service at /musikanimal/pv. It seems the Ruby app is adding around a 10-20ms overhead, compared to hitting the RESTBase endpoint directly. It's difficult to definitively say, since the RESTBase API response time is quite variable. E.g. sometimes the Ruby app may appear to respond faster, but obviously that's coincidental. I've also set the caching headers to be the same, so the client will cache subsequent requests just as the actual API does.

So with say, 10 pages * 50 redirects (most extreme of extreme edge cases), we're adding up to 10 seconds. Not so great, but the people looking for that kind of data are probably willing to wait for it. Most people just click the link on wiki to see stats on a single page, and they won't notice anything.

All things considered I'm convinced this idea of a middleman service is viable and worthwhile. We'll need to do something better than this Ruby app, though. It runs fine now, but with production traffic I am certain it will die at some point =P

Let's hold off for the moment. The services team is considering renaming the route and setting up a redirect for /pageviews. This would solve everything!

We'll need to do something better than this Ruby app, though. It runs fine now, but with production traffic I am certain it will die at some point =P

I took a shot at setting up a reverse proxy using lighttpd alone, but it doesn't support accessing an HTTPS backend server. There is also a bug around putting a rewrite and a reverse proxy together in the version of lighttpd that we have.

Writing a thin proxy in PHP probably isn't going to help much as PHP doesn't have a way to pool and reuse existing HTTPS sessions which is going to be a non-trivial overhead in this type of reverse proxy. I really think the best thing to do would be to add an alias (or outright renaming) at the RESTBase layer itself.

Have you considered asking for a whitelisting at https://forums.lanik.us/ (the EasyList forums)?

I got them to whitelist tools.wmflabs.org: https://hg.adblockplus.org/easylist/rev/a45b2b79abc8

They ignored the part about the API though. Sounds like Analytics might be changing that URL anyway, but we'll see.

In the meantime, we should probably improve the notice that people see when the content is blocked, i.e. telling people to unsubscribe from EasyPrivacy.

In the meantime, we should probably improve the notice that people see when the content is blocked, i.e. telling people to unsubscribe from EasyPrivacy.

We do inform them to whitelist /pageviews which I think is favourable, given they may not want to remove all of EasyPrivacy. What do you think?

In the meantime, we should probably improve the notice that people see when the content is blocked, i.e. telling people to unsubscribe from EasyPrivacy.

We do inform them to whitelist /pageviews which I think is favourable, given they may not want to remove all of EasyPrivacy. What do you think?

If we want to do that, we should give them explicit instructions to go about it, at least for the major Ad Blockers.

If we want to do that, we should give them explicit instructions to go about it, at least for the major Ad Blockers.

No opposition there. The two major ones I believe are AdBlock Plus and uBlock. Note we can also use inline styling e.g. style='font-weight:bold; color: red', since the CSS files under /pageviews may have been blocked from loading.

Dan requested that the EasyPrivacy list whitelist the API on April 2nd. No response yet.

Should we close this task? I think we've done all we can do. We've added the notice for users who are using an ad blocker. Meanwhile, with tools.wmflabs.org now whitelisted by EasyPrivacy, the app should load for anyone running the most recent version of AdBlock Plus and uBlock. At least for me, both ad blockers updated themselves. @Niharika you said a while back you were using an older version of AdBlock Plus. Did it automatically update for you too?

kaldari claimed this task.
kaldari edited projects, added Community-Tech-Sprint; removed Community-Tech.
kaldari moved this task from In Development to Q1 2018-19 on the Community-Tech-Sprint board.