scope: deliver pageviews with geocoded data.
Need to supress data where there is not enough traffic, similar to what Erik Z does on stats.
scope: deliver pageviews with geocoded data.
Need to supress data where there is not enough traffic, similar to what Erik Z does on stats.
privacy implications must be very carefully considered here. The current reports that are created by Erik Z. take a lot of care to add fuzziness where it's needed.
I think we need an example of how does the report look like. Are pageviews per page? Per project? Per language?
Per T90499, for Zero, we would need a table with the following columns: language, subdomain (nothing|m|zero), site (wikipedia|...), country, count, bandwidth:
date language subdomain site page_views content_size 2001-01-15 en m wikipedia.org 1000000 50000000 ...
The last piece, content size, is the total user-downloaded traffic, including bits and multimedia. GeoTagging those requests probably relates to T89177 - tagging all traffic with zero= tag.
I am not sure we ill be able to provide content size. Doesn't seem like we would.
Please note that this format does not include country data, it includes "language" and that data is already available in the regular dumps:
date language subdomain site page_views
2001-01-15 en m wikipedia.org 1000000
@Nuria, why not? Content size should be fairly easy to obtain once tagging is enabled on all traffic - you simply run a summing groupby query without filtering by is_page_view, and join the result with the counting page-view-filtered query. MIght need to polish syntax here and add the break-up by subdomains, etc
select date, geo, cnt, size from (select date, geo, count(*) cnt from quests where is_page_view group by date, geo) counts, (select date, geo, sum(content) size from quests group by date, geo) sizes OUTER JOIN counts.date = sizes.date AND counts. counts.geo = sizes.geo
Well, I did not know we haad any plans to tag all request data, first time I have heard about it. Once that is done it should be theoretically possible to get content size if this types of queries are performant enough.