Page MenuHomePhabricator

Set up pageview counting for KaiOS app
Closed, ResolvedPublic

Description

The Inuka team is getting ready to deploy a Wikipedia app for KaiOS, and we need to make sure that its page views are counted properly.

Based on the pageview definition and the server-side filtering code, this is what needs to happen:

  • Analytics adds a regex matching the app's user agent to the filtering code

Event Timeline

@Nuria is the list of steps in the description complete?

mmm no, the pageview header was used for something else (discarding "previews" on some app functionality that is - I think- no longer alive)

Can't these pageviews be events sent to the eventlogging pipeline similar to page previews? That would be the most precise way to count them. They can use the same schema: https://meta.wikimedia.org/wiki/Schema:VirtualPageView

mmm no, the pageview header was used for something else (discarding "previews" on some app functionality that is - I think- no longer alive)

Can't these pageviews be events sent to the eventlogging pipeline similar to page previews? That would be the most precise way to count them. They can use the same schema: https://meta.wikimedia.org/wiki/Schema:VirtualPageView

I'm sure the Inuka team will be happy to follow whatever the proper procedure is for getting them counted as pageviews, once they know what it is :)

But based on my read of the VirtualPageView schema and the refinement code, that's used to count previews. We want these counted as pageviews.

The user agent of the app must be changed to "WikipediaApp" (at the beginning of the string). Additionally, if you can send us what the URLs look like when requesting data, we can advise on the best approach to make them count as pageviews, since the pageview = 1 approach is out of date.

Here's one example of the requested link

https://en.wikipedia.org/beacon/event?%7B%22event%22%3A%7B%22user_id%22%3A%22b0d2a66324e6da7f759e%22%2C%22session_id%22%3A%22e2bcd8a41062f458fc60%22%2C%22pageview_token%22%3A%22569f63d0d6fb58d021e5%22%2C%22client_type%22%3A%22kaios-app%22%2C%22referring_domain%22%3Anull%2C%22load_dt%22%3A%222020-02-14T09%3A56%3A19.951Z%22%2C%22page_open_time%22%3A8681%2C%22page_visible_time%22%3A7005%2C%22section_count%22%3A12%2C%22opened_section_count%22%3A3%2C%22page_namespace%22%3A0%2C%22is_main_page%22%3Afalse%2C%22is_search_page%22%3Afalse%7D%2C%22revision%22%3A19739286%2C%22schema%22%3A%22InukaPageView%22%2C%22webHost%22%3A%22en.wikipedia.org%22%2C%22wiki%22%3A%22enwiki%22%7D;
// decode form
{
    "event": {
        "user_id":"b0d2a66324e6da7f759e",
        "session_id":"e2bcd8a41062f458fc60",
        "pageview_token":"569f63d0d6fb58d021e5",
        "client_type":"kaios-app",
        "referring_domain":null,
        "load_dt":"2020-02-14T09:56:19.951Z",
        "page_open_time":8681,
        "page_visible_time":7005,
        "section_count":12,
        "opened_section_count":3,
        "page_namespace":0,
        "is_main_page":false,
        "is_search_page":false
    },
    "revision":19739286,
    "schema":"InukaPageView",
    "webHost":"en.wikipedia.org",
    "wiki":"enwiki"
}

This is an eventlogging event that will be persisted on the InukaPageView table on the events database. There is nothing additional that needs doing to do for that to happen. We can set up ingestion of those events into druid similar than how it is done now for virtualpageviews.

@hueitan, @Nuria , sorry for the confusing name—the InukaPageView data stream is meant for a separate, time-limited analysis, not for representing KaiOS app readership in the main pageview dataset. There's a good chance we will be deactivating InukaPageView after some months even if we continue supporting the app.

It seems logical that for consistency, we should set up basic pageview counting the same way the Android and iOS apps do. I tried to summarize how to do that in the description, but it sounds like I'm out of date, so please tell me more 😊

It seems logical that for consistency, we should set up basic pageview counting the same way the Android and iOS apps do

We need to know the urls the application is hitting and how do they look

It seems logical that for consistency, we should set up basic pageview counting the same way the Android and iOS apps do

We need to know the urls the application is hitting and how do they look

Ahh, okay. Based on a quick check of the code, it looks like the app is fetching pages from the Page Content Service, but @hueitan will know all the details 🙂

Ahh, okay. Based on a quick check of the code, it looks like the app is fetcing pages from the Page Content Service, but @hueitan will know all the details 🙂

Yes, here are some examples

  1. Fetching the summary
https://en.wikipedia.org/api/rest_v1/page/summary/Domesticated
  1. Fetching the article
https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Domestication
  1. Fetching the article media
https://en.wikipedia.org/api/rest_v1/page/media/Domestication
  1. Search result cat
https://en.wikipedia.org/w/api.php?format=json&formatversion=2&origin=*&action=query&prop=description|pageimages|pageprops&piprop=thumbnail&pilimit=15&ppprop=displaytitle&generator=prefixsearch&redirects=true&pithumbsize=64&gpslimit=15&gpsnamespace=0&gpssearch=cat
  1. Get the suggested articles in article cat
https://en.wikipedia.org/w/api.php?format=json&formatversion=2&origin=*&action=query&prop=pageimages|description&piprop=thumbnail&pithumbsize=160&pilimit=3&generator=search&gsrsearch=morelike:Cat&gsrnamespace=0&gsrlimit=3&gsrqiprofile=classic_noboostlinks&uselang=content

These key is dynamic based on the article language

webHost - en.wikipedia.org fr.wikipedia.org hi.wikipedia.org
wiki - enwiki frwiki hiwiki

@Nuria
Regarding the base url en.wikipedia.org/beacon/event, do we

  1. based on the article language, for example in france language, fr.wikipedia.org/beacon/event?
  2. use mobile version? en.m.wikipedia.org/beacon/event

Regarding the base url en.wikipedia.org/beacon/event, do we

Those beacon pings are not counted pageviews so for this ticket it does not matter

Regarding the base url en.wikipedia.org/beacon/event, do we

Those beacon pings are not counted pageviews so for this ticket it does not matter

Ah, right; this probably relates to T242358 (our EventLogging data collection) instead. Is that right, @hueitan?

Regarding the base url en.wikipedia.org/beacon/event, do we

Those beacon pings are not counted pageviews so for this ticket it does not matter

Ah, right; this probably relates to T242358 (our EventLogging data collection) instead. Is that right, @hueitan?

Yes

The UA key or request header is not yet confirmed.

In order to count pageviews the UA of the app needs to be "wikipediaApp/<some_version>"

In order to count pageviews the UA of the app needs to be "wikipediaApp/<some_version>"

Unfortunately, it turns out that it's not possible for a KaiOS app to modify the user agent (T242358#5928187, T242358#5936851).

Given this, what would you suggest we do? Perhaps we could use the X-Analytics header to indicate that the request is coming from an app and provide the version?

SBisson moved this task from Backlog to Dev on the Inuka-Team (Kanban) board.
SBisson added a subscriber: eamedina.

I don't know what's going on but I can set the UA on the API calls (not sendBeacon though)

See https://github.com/wikimedia/wikipedia-kaios/pull/171

@hueitan , @eamedina please test carefully on your devices

@Nuria (and other analytics folks) I've been making API calls from my device with the updated UA, can you already see some data coming in?

The user agent of the app must be changed to "WikipediaApp" (at the beginning of the string). Additionally, if you can send us what the URLs look like when requesting data, we can advise on the best approach to make them count as pageviews, since the pageview = 1 approach is out of date.

I've updated the PR above to have the app id at the beginning of the UA.

SBisson moved this task from Code Review to Dev on the Inuka-Team (Kanban) board.

The UA part is merged.

We're ready for the next part:

Analytics adds a regex matching the app's user agent to the filtering code

There are several things:

  • also there are different types of requests: "summary" , "meta data" and "media" fetching content (per comment above), is the intent to count all these as a pageview? to be clear that is not done now for other requests at this time

I propose that we track these pageviews as events that are joined with the pageview pipeline later.

  1. an eventlogging event is sent on page transition according to a schema that defines a pageview (actually the schema can be the same one than the one for virtualpageview but with a different name, rather, simply, "pageview")
  1. Analytics team does the work of joining this event pageview stream to pageview_hourly

Does 1) seem feasible? If so my team can take care of #2

Sorry, I mean to submit my comment yesterday and just realized I never did. Please be so kind to take a look @SBisson @nshahquinn-wmf

There are several things:

  • also there are different types of requests: "summary" , "meta data" and "media" fetching content (per comment above), is the intent to count all these as a pageview? to be clear that is not done now for other requests at this time

For this app, a call to the mobile-sections endpoint maps to a pageview. metadata and media provide enhancements to the article view. And summary is for a page preview.

I propose that we track these pageviews as events that are joined with the pageview pipeline later.

  1. an eventlogging event is sent on page transition according to a schema that defines a pageview (actually the schema can be the same one than the one for virtualpageview but with a different name, rather, simply, "pageview")

This is not really an option for us. It wouldn't easily fit into our current schedule but more importantly, we are worried we are making too many network requests already. We are targeting low-end devices on networks that are not always reliable.

The calls to mobile-sections are unavoidable so if they can be used to track pageviews that's ideal.

  1. Analytics team does the work of joining this event pageview stream to pageview_hourly

Does 1) seem feasible? If so my team can take care of #2

I see, we will go ahead in our end and:

  1. we will count as pageviews the https://en.wikipedia.org/api/rest_v1/page/mobile-sections/<some> requests
  1. requests will look for UA: <blah> WikipediaApp/1.0
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.

I see, we will go ahead in our end and:

  1. we will count as pageviews the https://en.wikipedia.org/api/rest_v1/page/mobile-sections/<some> requests

Thank you!

  1. requests will look for UA: <blah> WikipediaApp/1.0

Just to be clear, the merged code sets the user agent as WikipediaApp/1.0 <blah>, based on comments from Dan and from Francisco that it should go at the beginning. If this is incorrect, please let us know.

Change 580161 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery/source@master] [WIP] Detect pageviews as requested by KaiOS

https://gerrit.wikimedia.org/r/580161

Added a comment here for the Inuka folks, do ping me on (I don't know what the hell we're doing for chat these days) somewhere :) to chat about it or just feel free to add a review:

https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/580161/

@Milimetric I would like to discuss here if you don't mind.

From the patch in gerrit:

Android devices are accessing the /page/mobile-sections/ endpoint too, how are you planning on finding your own pageviews in here? Via the user agent? Have you checked whether that gets detected properly by UA Parser?

If the UA can be used that's great. I did update UA Core a while back to track KaiOS but I don't know how well it's doing. @nshahquinn-wmf what do you think?

@Milimetric the distinction is no different than the one it happens now across IOS and android, in pageview_hourly, the only way to tell those records appart is the UA. See for example different UAS used by IOs and Android

{browser_major=-, os_family=iOS, os_minor=4, os_major=11, wmf_app_version=5.7.2.1267, device_family=Other, browser_family=Other}
{browser_major=-, os_family=iOS, os_minor=1, os_major=12, wmf_app_version=5.7.1.1259, device_family=Other, browser_family=Other}
{browser_major=4, os_family=Android, os_minor=2, os_major=4, wmf_app_version=2.5.190-r-2017-02-24, device_family=Generic Smartphone, browser_family=Android}
{browser_major=5, os_family=Android, os_minor=0, os_major=5, wmf_app_version=2.7.234-r-2018-05-30, device_family=Generic Smartphone, browser_family=Android}

Android devices are accessing the /page/mobile-sections/ endpoint too

@Milimetric will your patch cause us to start classifying these Android app webrequests as pageviews? I just want to make sure we're not interfering with the Android app's pageview definition.

If the UA can be used that's great. I did update UA Core a while back to track KaiOS but I don't know how well it's doing. @nshahquinn-wmf what do you think?

The KaiOS useragent parsing seems to be working fine; after we deployed it, 99.5% of the Firefox OS views became KaiOS views:
(note that the overall decline is separate—T242853)

Screen Shot 2020-03-18 at 20.15.57.png (1×2 px, 189 KB)

The only other thing is ensuring that the app field gets properly set from the useragent so we can distinguish KaiOS web and KaiOS app traffic. The best thing to do there is probably to merge @Milimetric's patch and make sure we start seeing a few KaiOS app pageviews from the team's testing devices.

@Milimetric will your patch cause us to start classifying these Android app webrequests as pageviews?

It would not. Android pageviews are classified as such because the OS is parsed as being "Android", which is not happening on the KaiOS case.

@Milimetric will your patch cause us to start classifying these Android app webrequests as pageviews?

No, it would not. interfere Android pageviews are classified as such because the OS is parsed as being "Android".

Change 580999 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] Use PageviewDefinition changes: KaiOS & wikimania

https://gerrit.wikimedia.org/r/580999

Change 580999 merged by Milimetric:
[analytics/refinery@master] Use PageviewDefinition changes: KaiOS & wikimania

https://gerrit.wikimedia.org/r/580999

Change 580161 merged by jenkins-bot:
[analytics/refinery/source@master] Detect pageviews as requested by KaiOS

https://gerrit.wikimedia.org/r/580161

@Milimetric will your patch cause us to start classifying these Android app webrequests as pageviews? I just want to make sure we're not interfering with the Android app's pageview definition.

Just to clarify a bit what @Nuria said:

I see a very small number of requests from Android user agents to /api/rest_v1/page/mobile-sections/. These are currently NOT tagged as pageviews, because the current pageview definition only tags requests like /api/rest_v1/page/mobile-sections-lead/. With the change, these small number of requests will become pageviews, but they will not affect the KaiOS numbers because the user agent doesn't have KaiOS anywhere in it. Over an hour I saw 7 such requests, so we figured it's not a big deal and merged it. It'd be interesting to understand why this happens at all.

@Milimetric thank you for getting the new refinement code merged! Has it been deployed? If not, do you have any idea when it will be?

Yep, it was deployed on Wednesday, I moved the task to Done (usually it would be in "In Code Review" or "Ready To Deploy" until it's actually running in production). So the latest refine should be using the new logic, are you seeing otherwise?

(Moving to QA on the Inuka board to better represent the current status)

@SBisson can I move this to done or something needs to be checked here?

There is a extremaly small number of pageviews for a day, but some are appearing:

presto> select user_agent_map from wmf.pageview_hourly where user_agent_map['os_family'] not in ('iOS', 'Android') and user_agent_map['wmf_app_version']!='-' and access_method='mobile app' and day=21 and year=2020 and month=3 limit 10;

user_agent_map

{browser_major=48, os_family=KaiOS, os_minor=5, os_major=2, wmf_app_version=1.0.0, device_family=Nokia 2720, browser_family=Firefox Mobile}
{browser_major=48, os_family=KaiOS, os_minor=5, os_major=2, wmf_app_version=1.0.0, device_family=Nokia 2720, browser_family=Firefox Mobile}
{browser_major=48, os_family=KaiOS, os_minor=5, os_major=2, wmf_app_version=1.0.0, device_family=Nokia 2720, browser_family=Firefox Mobile}
(3 rows)

Closing on our end

Thanks for checking, @Nuria! I see them too now.

dateKaiOS_app_versionpageviews
2020-03-181.0.021
2020-03-191.0.013
2020-03-201.0.026
2020-03-211.0.03
2020-03-231.0.052

However, I want to check a few potential issues I noticed while looking at the details:

  • All the values of page_id and namespace_id are null. This may be intentional since we already have project and page_title, though.
  • All the values of referer_class are null. This makes sense since I'm sure the app isn't setting a referrer on its requests. However, in reality all of the views from the app will be internally referred: KaiOS doesn't provide any way for an app to handle URLs so all activity will come from navigation within the app. Is there a way we can set them all to have "internal" in referer_class? Perhaps the app can send a dummy internal referer like "https://www.wikipedia.org/" with each request?

All the values of page_id and namespace_id are null. This may be intentional since we already have project and page_title, though.

Those are not automatically filled in, they will be empty also for other request that hit the APIs rather than mediawiki engine. See docs on this regard here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest

Is there a way we can set them all to have "internal" in referer_class?

Not for this pipeline, this is one of the reasons why these pageviews are better modeled as events.

@Nuria, thanks for the details!

Do you see any problem with us always sending https://www.wikipedia.org in the referer header to ensure the referrer class is set as internal?

Do you see any problem with us always sending https://www.wikipedia.org in the refrerer header to ensure the referrer class is set as internal?

Seems quite and odd setup for a mobile app to be honest, as that page was never visited. Given that all KaiOs pageviews with access_type "mobile" come from the app, does that referrer give you any information?

Do you see any problem with us always sending https://www.wikipedia.org in the refrerer header to ensure the referrer class is set as internal?

Seems quite and odd setup for a mobile app to be honest, as that page was never visited. Given that all KaiOs pageviews with access_type "mobile" come from the app, does that referrer give you any information?

When we're just analyzing the app's traffic, no. But I'm thinking about broader analyses. For example, what we set matters to someone breaking down Indian mobile pageviews by referrer class, like I did for T242853. Obviously, we could send the web address of the actual referring page, but that takes extra work to implement and doesn't have any impact on what gets saved in pageview_hourly.

Do you see any problem with us always sending https://www.wikipedia.org in the refrerer header to ensure the referrer class is set as internal?

Seems quite and odd setup for a mobile app to be honest, as that page was never visited. Given that all KaiOs pageviews with access_type "mobile" come from the app, does that referrer give you any information?

When we're just analyzing the app's traffic, no. But I'm thinking about broader analyses. For example, what we set matters to someone breaking down Indian mobile pageviews by referrer class, like I did for T242853. Obviously, we could send the web address of the actual referring page, but that takes extra work to implement and doesn't have any impact on what gets saved in pageview_hourly.

Since I haven't heard about any problems with this approach (even if it is a bit hacky), we're planning to implement it (T249216).