In theory we only record pageviews for requests that come back with a 200/304 error code, See: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java
On pageview hourly I see a bunch of requests the seem to be pageviews:
select * from pageview_hourly where page_title like "%google-api%" and year=2016 and day=15 and month=08 and hour=20;
Returns:
en.wikipedia default User:GoogleAnalitycsRoman6/google-api desktop NULL user external South America BR Brazil Pernambuco Jaboatao dos Guararapes {"browser_major":"50","os_family":"Windows 7","device_family":"Other","os_major":"-","browser_family":"Chrome","wmf_app_version":"-","os_minor":"-"}
But there isn't any of those that are pageviews on webrequest table:
select * from webrequest where uri_path like "%google-api%" and year=2016 and day=15 and month=08 and hour=20 and is_pageview=true limit 10;
Will return no results as recods with uri "User:GoogleAnalitycsRoman6/google-api" are 404s.
For example:
cp1067.eqiad.wmnet 750389096 2016-08-15T20:17:32 1.12057E-4 179.183.182.151 hit 404 5639 GET en.wikipedia.org /wiki/User:GoogleAnalitycsRoman/google-api text/html; charset=UTF-8 https://www.facebook.com/xti.php?some-data Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36 en-US,en;q=0.8,pt-BR;q=0.6,pt;q=0.4 ns=2;WMF-Last-Access=15-Aug-2016;https=1 - false 0.0.16 179.183.182.151 {"city":"Feira de Santana","country_code":"BR","longitude":"-38.95","postal_code":"Unknown","timezone":"America/Bahia","subdivision":"Bahia","continent":"South America","latitude":"-12.25","country":"Brazil"} cp1053 miss, cp1067 hit/47 {"browser_major":"49","os_family":"Windows XP","os_major":"-","device_family":"Other","browser_family":"Chrome","os_minor":"-","wmf_app_version":"-"} {"WMF-Last-Access":"15-Aug-2016","https":"1","ns":"2"} 2016-08-15 20:17:32 desktop user false external {"project_class":"wikipedia","project":"en","qualifiers":null,"tld":"org"} NULL NULL text 2016 8 15 20
Webrequest: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest
Pageview hourly: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly
So we might have a bug that is moving those records to be pageview_hourly (which the table from which pageview api is loaded) when it shouldn't be.