In theory we only record pageviews for requests that come back with a 200/304 error code, See: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java
On pageview hourly I see a bunch of requests the seem to be pageviews:
select * from pageview_hourly where page_title like "%google-api%" and year=2016 and day=15 and month=08 and hour=20;
Returns:
en.wikipedia default User:GoogleAnalitycsRoman6/google-api desktop NULL user external South America BR Brazil Pernambuco Jaboatao dos Guararapes {"browser_major":"50","os_family":"Windows 7","device_family":"Other","os_major":"-","browser_family":"Chrome","wmf_app_version":"-","os_minor":"-"}
But there isn't any of those that are pageviews on webrequest table:
select * from webrequest where uri_path like "%google-api%" and year=2016 and day=15 and month=08 and hour=20 and is_pageview=true limit 10;
Will return no results as recods with uri "User:GoogleAnalitycsRoman6/google-api" are 404s.
For example:
cp1067.eqiad.wmnet /wiki/User:GoogleAnalitycsRoman/google-api text/html; charset=UTF-8 https://www.facebook.com/xti.php?some-data Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36 en-US,en;q=0.8,pt-BR;q=0.6,pt;q=0.4 ns=2;WMF-Last-Access=15-Aug-2016;https=1
Webrequest: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest
Pageview hourly: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly
So we might have a bug that is moving those records to be pageview_hourly (which is the table from which pageview api is loaded) when it shouldn't be.