Page MenuHomePhabricator

Spamy - User-like pages distort our pageview metrics (they return 200 when they should return 404)
Closed, ResolvedPublic

Description

In theory we only record pageviews for requests that come back with a 200/304 error code, See: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java

On pageview hourly I see a bunch of requests the seem to be pageviews:

select * from pageview_hourly where page_title like "%google-api%" and year=2016 and day=15 and month=08 and hour=20;
Returns:

en.wikipedia default User:GoogleAnalitycsRoman6/google-api desktop NULL user external South America BR Brazil Pernambuco Jaboatao dos Guararapes {"browser_major":"50","os_family":"Windows 7","device_family":"Other","os_major":"-","browser_family":"Chrome","wmf_app_version":"-","os_minor":"-"}

But there isn't any of those that are pageviews on webrequest table:

select * from webrequest where uri_path like "%google-api%" and year=2016 and day=15 and month=08 and hour=20 and is_pageview=true limit 10;

Will return no results as recods with uri "User:GoogleAnalitycsRoman6/google-api" are 404s.

For example:

cp1067.eqiad.wmnet /wiki/User:GoogleAnalitycsRoman/google-api text/html; charset=UTF-8 https://www.facebook.com/xti.php?some-data Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36 en-US,en;q=0.8,pt-BR;q=0.6,pt;q=0.4 ns=2;WMF-Last-Access=15-Aug-2016;https=1
Webrequest: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest

Pageview hourly: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly

So we might have a bug that is moving those records to be pageview_hourly (which is the table from which pageview api is loaded) when it shouldn't be.

Event Timeline

@Nuria: Page title is not only extracted from uti_path, but sometimes from uri_query.
For pageviews you can access extracted title in pageview_info map:

select * from wmf.webrequest where pageview_info['page_title'] like "%google-api%" and year=2016 and day=15 and month=08 and hour=20 and is_pageview=true limit 10;

This query return some rows.

FYI @JMinor (since apps might be surfacing non page views) - 404.php has been the most read article for quite some time.

@greg I'm a little concerned that 404s are being hit so frequently - worth investigating in a separate bug?

@greg I'm a little concerned that 404s are being hit so frequently - worth investigating in a separate bug?

I'm not terribly off hand, but if there is a noticeable change in frequency then yeah, definitely.

@JAllemandou : you are right these are indeed pageviews for pages like: https://en.wikipedia.org/w/index.php?title=User%20:GoogleAnalitycsRoman6/google-api&action=history (which existed on Sep-22nd).

I think is a mouse &cat game of these pages being created by some tool and scraped over and over, sounds like captchas are not working.

It is confusing cause all this spam traffic is hitting similar pages some of which have already been deleted and thus are 404s.

So, this is no bug of pageview code but rather our spam filters when it comes to account creation are clearly not working. Would @greg knwn who owns this?

Nuria renamed this task from Pageview hourly stores records that are not really pageviews and those end up on top endpoint? to Spamy -User pages that should not be allowed to be created are hit by bots and distort our pageview metrics.Sep 23 2016, 6:53 PM
Nuria edited projects, added Analytics; removed Analytics-Kanban.

Untagging Kanban and moving to radar.

So, this is no bug of pageview code but rather our spam filters when it comes to account creation are clearly not working. Would @greg knwn who owns this?

I don't understand, User:GoogleAnalitycsRoman1/google-api has never existed (the user or the page) yet it and other nonexistent pages show up in /top with tens of thousands of pageviews. These should be 404s, and hence not recorded by the pageview code, no?

Nuria renamed this task from Spamy -User pages that should not be allowed to be created are hit by bots and distort our pageview metrics to Spamy - User pages that should not be allowed to be created are hit by bots and distort our pageview metrics.Sep 23 2016, 7:01 PM

I don't understand, User:GoogleAnalitycsRoman1/google-api has never existed

The pageview API reports on requests that (besides some additional criteria) report 200. This page: https://en.wikipedia.org/w/index.php?title=User%20:GoogleAnalitycsRoman6/google-api&action=history does return a 200 and thus, it is considered a pgeview in mediawiki world.

The pageview API reports on requests that (besides some additional criteria) report 200. This page: https://en.wikipedia.org/w/index.php?title=User%20:GoogleAnalitycsRoman6/google-api&action=history does return a 200 and thus, it is considered a pgeview in mediawiki world.

Why are we going off of action=history? That will (evidently) always return a 200. If you go off of the subject page itself you correctly get a 404.

Nuria renamed this task from Spamy - User pages that should not be allowed to be created are hit by bots and distort our pageview metrics to Spamy - User-like pages that should not be allowed to be created are hit by bots and distort our pageview metrics (return 200).Sep 23 2016, 7:11 PM

Sorry, I think I understand... so you count action=history (and possibly other actions) as a pageview in addition to viewing the actual subpage. Apologies as clearly I don't know how the code works, but if possible I think we should restrict recording pageviews to the subject page only. Consumers of the data I believe are interested in readership, and no content is visible from action=history

Why are we going off of action=history? That will (evidently) always return a 200.

It is not evident (to me) why would that always return a 200 given that as you mention that page doesn't exist. Now, I do not know much about mediawiki internals. In this case it looks like we are being used as a keep-alive mechanism of some sort, by someone building urls to wikipedia that are known to return 200 code. Seems a waste of our resources to send responses for these.

Note that the processing pipeline for requests does not check on existance of pages, we just report on varnish http codes plus other criteria on what constitutes a pageview (for example: hits to pageviews that were deleted a few minutes after are still pageviews)

In this case there is probably workarrounds that can be implemented to alleviate this problem but (seems to me) the root cause is that these urls should not return 200. Might be that I am missing some context here.

Nuria renamed this task from Spamy - User-like pages that should not be allowed to be created are hit by bots and distort our pageview metrics (return 200) to Spamy - User-like pages that should not be allowed to be created are hit by bots and distort our pageview metrics (they return 200).Sep 23 2016, 7:24 PM

In this case there is probably workarrounds that can be implemented to alleviate this problem but (seems to me) the root cause is that these urls should not return 200. Might be that I am missing some context here.

I would have to agree :) I would not expect the revision history or any similar action of a nonexistent page to return a 200

Ok, long standing issue in mediawiki: https://phabricator.wikimedia.org/T26144 regarding these "fake 200s"

Nuria renamed this task from Spamy - User-like pages that should not be allowed to be created are hit by bots and distort our pageview metrics (they return 200) to Spamy - User-like pages distort our pageview metrics (they return 200 when they should return 404).Sep 26 2016, 3:03 PM
Nuria claimed this task.
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria moved this task from Next Up to In Code Review on the Analytics-Kanban board.

Confirming that pages such as this one: https://en.wikipedia.org/?title=User:GoogleAnalitycsRoman/google-api&action=history
return 404s now

They should no longer be present on top endpoints or appear as pageviews

To wrap this up here is an example of pageviews that we are not serving. Numbers for pageviews like 'GoogleAnalitycsRoman/some" in September from different countries. The total number of requests in September for this type of "fake page" was about 6.5 millions (6.387.122). cc @JAllemandou

2016-09-30 4328 Colombia
2016-09-30 231651 Brazil
2016-09-30 876 Venezuela
2016-09-30 4804 Argentina
2016-09-30 11070 United States
2016-09-29 1134 Venezuela
2016-09-29 1485 United States
2016-09-29 255743 Brazil
2016-09-29 9119 Colombia
2016-09-29 1062 Argentina
2016-09-28 4246 Venezuela
2016-09-28 543 Colombia
2016-09-28 3755 Argentina
2016-09-28 31325 United States
2016-09-28 228761 Brazil
2016-09-27 34279 United States
2016-09-27 136 Venezuela
2016-09-27 1464 Colombia
2016-09-27 69 Mexico
2016-09-27 262966 Brazil
2016-09-26 296 Portugal
2016-09-26 81 Mexico
2016-09-26 277011 Brazil
2016-09-26 699 Argentina
2016-09-26 10062 United States
2016-09-26 1741 Colombia
2016-09-25 43289 Brazil
2016-09-25 2864 United States
2016-09-25 28 Colombia
2016-09-24 69206 Brazil
2016-09-23 273302 Brazil
2016-09-23 1 Poland
2016-09-23 3388 Colombia
2016-09-23 1 United Kingdom
2016-09-23 1 Mozambique
2016-09-23 5 Germany
2016-09-23 1634 Argentina
2016-09-23 2 Spain
2016-09-23 17938 United States
2016-09-23 1219 Mexico
2016-09-22 115 United States
2016-09-22 74 Colombia
2016-09-22 3377 Argentina
2016-09-22 233522 Brazil
2016-09-21 244918 Brazil
2016-09-21 121 United States
2016-09-21 283 Colombia
2016-09-21 809 Argentina
2016-09-20 824 Argentina
2016-09-20 33 United States
2016-09-20 229513 Brazil
2016-09-20 45 Venezuela
2016-09-19 1956 Colombia
2016-09-19 53 Argentina
2016-09-19 253108 Brazil
2016-09-19 52 United States
2016-09-18 2518 Argentina
2016-09-18 26603 United States
2016-09-18 68 Colombia
2016-09-18 36188 Brazil
2016-09-17 85035 Brazil
2016-09-16 711 United States
2016-09-16 38 Italy
2016-09-16 276470 Brazil
2016-09-15 2 Mexico
2016-09-15 2172 Argentina
2016-09-15 370 United States
2016-09-15 269379 Brazil
2016-09-14 3140 Argentina
2016-09-14 4856 United States
2016-09-14 226062 Brazil
2016-09-14 1289 Colombia
2016-09-13 2811 United States
2016-09-13 260833 Brazil
2016-09-13 4547 Argentina
2016-09-13 111 Colombia
2016-09-12 243246 Brazil
2016-09-12 17987 United States
2016-09-12 2941 Argentina
2016-09-11 139 United States
2016-09-11 2133 Argentina
2016-09-11 56 Switzerland
2016-09-11 79092 Brazil
2016-09-11 226 Colombia
2016-09-10 1762 Colombia
2016-09-10 91069 Brazil
2016-09-10 450 Mexico
2016-09-09 271533 Brazil
2016-09-09 4 Portugal
2016-09-09 138 United States
2016-09-08 1438 Argentina
2016-09-08 328 Switzerland
2016-09-08 311075 Brazil
2016-09-08 695 United States
2016-09-07 277 United States
2016-09-07 1738 Mexico
2016-09-07 392 Argentina
2016-09-07 60685 Brazil
2016-09-07 6 Colombia
2016-09-06 252 Colombia
2016-09-06 290737 Brazil
2016-09-06 6121 United States
2016-09-06 2964 Argentina
2016-09-06 492 Switzerland
2016-09-05 128 Colombia
2016-09-05 297621 Brazil
2016-09-05 2617 Argentina
2016-09-05 823 Mexico
2016-09-04 46683 Brazil
2016-09-04 337 Colombia
2016-09-04 2111 Argentina
2016-09-03 475 Colombia
2016-09-03 2067 Argentina
2016-09-03 63957 Brazil
2016-09-02 227 Colombia
2016-09-02 42 United States
2016-09-02 135 South Africa
2016-09-02 774 Argentina
2016-09-02 317746 Brazil
2016-09-02 1 Italy
2016-09-01 294358 Brazil
2016-09-01 23 Colombia
2016-09-01 772 United States
2016-09-01 4659 Argentina