Page MenuHomePhabricator

Understand traffic to Hindi Wikipedia in Madhya Pradesh during awareness campaign
Closed, ResolvedPublic

Description

Task Description
During the Hindi awareness campaign in Madhya Pradesh (T193292), Hindi Wikipedia was promoted to nearly 4M video viewers across the state.

Initial analysis covered Hindi across India, and all languages in Madhya Pradesh, but did not include a specific analysis of traffic in the state. For a complete report on the campaign, we need to understand traffic during and following the campaign.

The initial report was created by an external analyst:

Acceptance Criteria

  • Charts showing traffic to Hindi in Madhya Pradesh across and following campaign
  • Estimate for earned unique devices and pageviews if possible
  • Analysis of any change in baseline views

Event Timeline

atgo triaged this task as High priority.Sep 13 2018, 7:44 PM
atgo created this task.
Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptSep 13 2018, 7:44 PM
Tbayer moved this task from Triage to Backlog on the Product-Analytics board.Sep 13 2018, 8:28 PM

Hello! Is there an ETA for this analysis?

Hi @atgo ! It's unclear to me what questions you're asking here:

1, You're not asking us to re-do the whole analysis, correct? Specifically, the problematic parts are section 6&7 in the external analyst's report (pageviews and unique devices), right?
2, Do you want to know the unique devices and pageviews of Hindi Wikipedia from Madhya Pradesh, before and after the campaign (whether users clickthrough from the campaign or not)?
3, Anything else?

Hi @chelsyx,

I am not expecting the whole analysis to be redone. I am looking specifically for parts 6&7, but targeted to Hindi and Madhya Pradesh (a specific state in India).

Yes, would like to know unique devices and PVs from the region/language, before/during/after campaign. Best case would be an analysis of earned traffic in the form of 1) earned UDs / PVs during the campaign and 2) change in baseline.

chelsyx added a comment.EditedOct 15 2018, 9:58 PM

Thanks for the quick response @atgo!

Using existing data in our database, we can get you the PVs from Madhya Pradesh by languages, but we can't do that for UDs. Currently the WMF don't count unique devices by sub region -- we only count them at the country level.

Also, I'm not sure what you mean by "earned traffic". If that means traffic directed by the campaign (users clickthrough the link in the campaign video), we can't do that neither since the data that contain the referrer information has been purged (webrequest table). The external analyst provide the PVs directed by the campaign in part 5 of the report, but it seems you want to see the number for a longer period.

Thoughts?

@chelsyx

UDs by language are useful, but really not if we can't do it for the
region.

Earned traffic would, ideally, include all earned that we saw through
referrers and an estimate of what came through as a result of the campaign,
but not necessarily a direct click. I get it if we can't get that, but
figured I'd ask :)

chelsyx moved this task from Backlog to Doing on the Product-Analytics board.

@atgo Got you! I will get you the PVs from Madhya Pradesh by languages from March to May, and compare the number year-over-year.

Ok, thanks. Much appreciated

Nuria added a subscriber: Nuria.Oct 22 2018, 5:54 PM

Some ideas and improvements that can be done in selects (from pdf attached)

Unique devices:

SELECT month, day, domain, uniques_estimate FROM wmf.unique_devices_per_domain_daily WHERE
country_code = ’IN’ AND domain like ’%%.wikipedia.org’ AND year = 2018 AND month BETWEEN 3
AND 5;

In the Unique Devices dataset there are two measures: uniques_underestimate and uniques_offset. I would look into changes separately over several days. If there is a change you will first see it on the "offset" number and later in the "underestimate". See: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Unique_Devices
Offset represents devices coming from 1-hit session, "underestimate" represents devices that have looked at wikipedia before. A noticeable effect from video will be visible in "offset" count.

Pageviews:
Data for queries in report is sampled, the pageview data that you will look at on pageview_hourly does not need to be sampled, it is already aggregated per page. You can use a model to discount seasonality but you can also do a cheap removal of seasonality data by substracting last year pageviews from this year pageviews, that way the "signal" for pageviews is clearer. (a very non sophisticated take on seasonality 'removal').

Also, do we know the page to which the video was taking viewers in facebook? It will be worthy it to look at pageviews for just that one page and plot what we see.

Thanks @Nuria !

Also, do we know the page to which the video was taking viewers in facebook? It will be worthy it to look at pageviews for just that one page and plot what we see.

According to T185584, the links we used for the campaign are:

But as you know, without webrequest data, we can't get the wprov parameter and get the referred pageviews. I can look into the main page pageviews though.

Hey @chelsyx just checking in on this - when do you expect to have the analysis complete?

Hi @atgo, since the online campaign was targeting Madhya Pradesh only while the TV campaign was broadcasting nationwide, I have to deal with them separately. And given that I will be OOO this Thur & Fri (sorry I forgot about it last week), I will probably finish the analysis next week.

Ok, thanks @chelsyx. Please let me know if there will be any other delays.

Some ideas and improvements that can be done in selects (from pdf attached)
Unique devices:
SELECT month, day, domain, uniques_estimate FROM wmf.unique_devices_per_domain_daily WHERE
country_code = ’IN’ AND domain like ’%%.wikipedia.org’ AND year = 2018 AND month BETWEEN 3
AND 5;
In the Unique Devices dataset there are two measures: uniques_underestimate and uniques_offset. I would look into changes separately over several days. If there is a change you will first see it on the "offset" number and later in the "underestimate". See: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Unique_Devices
Offset represents devices coming from 1-hit session, "underestimate" represents devices that have looked at wikipedia before. A noticeable effect from video will be visible in "offset" count.

Well, it's likely that the video also reached people who had used Wikipedia before and motivated some of them to return earlier than they might have otherwise. But yes, I agree that looking at the "offset" part separately could be quite interesting here (I had suggested this earlier myself at T185584#3951523 , as @atgo may recall).

Thanks @Nuria !

Also, do we know the page to which the video was taking viewers in facebook? It will be worthy it to look at pageviews for just that one page and plot what we see.

According to T185584, the links we used for the campaign are:

But as you know, without webrequest data, we can't get the wprov parameter and get the referred pageviews. I can look into the main page pageviews though.

To avoid confusion: The Facebook- and YouTube-referred pageviews were already calculated (via the wprov parameter) in part 5 of the external analyst's report (and separately, I had also ran a quick query myself just to make sure we wouldn't lose the data before it expired, not separated by state. See T185584#4440086 ).

atgo moved this task from Backlog to Done on the New-Readers board.Nov 29 2018, 9:16 PM

Draft of the report: https://wikimedia-research.github.io/Audiences-New_Readers-Hindi_Video_Campagin-April_2018/

@chelsyx still needs to fix a few things of the report before closing this ticket, but won't expect big changes.