Page MenuHomePhabricator

Aggregate and save the Google Translate Pageview count temporarily
Closed, ResolvedPublic

Description

As the webrequest data from November is going to be purged soon, we want to aggregate and save the Google Translate Pageview count temporarily in @chelsyx 's database. This should be a one-off solution, and this data is going to be transformed and moved to production database until we figure out a long term solution.

The query we're going to use is based on @dr0ptp4kt 's trend query:

select
    count(1) as count,
    year,
    month,
    day,
    http_method,
    http_status,
    uri_host,
    agent_type,
    access_method,
    parse_url(referer, 'HOST') as referer_host,
    referer_class,
    parse_url(referer, 'QUERY') like '%client=srp%' as client_srp,
    regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])hl=([^&]*)', 2) as home_language,
    regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])sl=([^&]*)', 2) as source_language,
    regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2) as to_language,
    regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])rurl=([^&]*)', 2) as rurl_param,
    geocoded_data['continent'] AS continent,
    geocoded_data['country_code'] AS country_code,
    geocoded_data['country'] AS country
from wmf.webrequest
where
    year = 2019
    and month = 2
    and is_pageview
    and x_analytics_map['translationengine'] = 'GT'
group by
    year, month, day,
    http_method,
    http_status,
    uri_host,
    agent_type,
    access_method,
    parse_url(referer, 'HOST'),
    referer_class,
    parse_url(referer, 'QUERY') like '%client=srp%',
    regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])hl=([^&]*)', 2),
    regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])sl=([^&]*)', 2),
    regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])tl=([^&]*)', 2),
    regexp_extract(parse_url(referer, 'QUERY'), '(^|[&?])rurl=([^&]*)', 2),
    geocoded_data['continent'],
    geocoded_data['country_code'],
    geocoded_data['country']
ORDER BY
    year, month, day
-- Limit enforced by hive strict mapreduce setting.
-- 1000000000 == NO LIMIT !
LIMIT 1000000000
;

Event Timeline

chelsyx triaged this task as Medium priority.Feb 2 2019, 6:06 AM
chelsyx created this task.
chelsyx moved this task from Triage to Doing on the Product-Analytics board.

This is done. The aggregated data from Nov 2018 to Jan 2019 is now in chelsyx.toledo_pageviews on Hadoop cluster.

Will it be possible later to backfill/update/extend either virtualpageview_hourly or pageview_hourly with data derived from this table? (cf. T212414#4864672)

It seems that some standard fields that are common to these tables and inform our core metrics, e.g. country_code or project, are not yet included here.

Will it be possible later to backfill/update/extend either virtualpageview_hourly or pageview_hourly with data derived from this table? (cf. T212414#4864672)

virtualpageview_hourly doesn't seem suitable in this case. As to pageview_hourly, it would be awesome if we can add a field for all the keys in x_analytics_map. I will open another ticket for this and let's discuss about it there.

It seems that some standard fields that are common to these tables and inform our core metrics, e.g. country_code or project, are not yet included here.

Thanks for the suggestion! I've added referer_class, access_method, continent, country_code and country to the table, and backfill the data since Dec 3 2018.

@dr0ptp4kt About the difference between home_language and to_language, I did some investigation and found that home_language controls the translation engine language (google translate toolbar, etc), while to_language is the translation target languages. Only when to_language is missing, pages will be translated to home_language.