Page MenuHomePhabricator

Invalid pageview data for iOS app
Closed, ResolvedPublic8 Estimated Story Points

Description

It seems that since around the time of the 5.0 launch, pageviews from the iOS app are no longer being recorded correctly:

iOS app pageviews (June 2015-April 2016).png (246×602 px, 16 KB)

Data source:

hive (default)> SELECT year, month, day, CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) as date, SUM(view_count) 
FROM wmf.pageview_hourly 
WHERE year = >0 AND access_method = 'mobile app' 
 AND user_agent_map['os_family'] = 'iOS' AND agent_type = 'user' 
GROUP BY year, month, day ORDER BY year, month, day LIMIT 1000;

See also T130432 (an issue which was caused by the app no longer sending unique device IDs in most cases - however, one would expect that it still sends correct user agents, i.e. that the same issue does not occur here)

Event Timeline

And it does not just affect pageviews tagged as iOS in user_agent_map, but also pageviews tagged as app views in access_method overall:

Mobile app pageviews (2015-12-22..2016-04-04).png (296×478 px, 16 KB)

Data source:

SELECT year, month, day, CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) as date, SUM(IF(access_method = 'mobile app', view_count, null)) AS Apps, SUM(IF(access_method = 'desktop', view_count, null)) AS Desktop, SUM(IF(access_method = 'mobile web', view_count, null)) AS MobileWeb FROM wmf.projectview_hourly WHERE year>0 AND agent_type = 'user' GROUP BY year, month, day ORDER BY year, month, day LIMIT 1000;

@JMinor and I did a test today, and it appears that the app's requests aren't even showing up in the webrequest table.

Data source:

SELECT * FROM wmf.webrequest WHERE year = 2016 AND month = 4 AND day = 5 AND hour = 21 AND ip = '198.73.209.5' AND page_id = 3145229;

(That's the page ID of https://en.wikipedia.org/wiki/David_Herter , which Josh accessed from the app over the office wifi - IP 198.73.209.5 - around that time. But the only results from that query are two separate views I made on desktop with Chrome.)

Thanks @Tbayer . I watched the request traffic on Charles proxy opted out on the app, and the request is made to the standard endpoint and I didn't see anything obvious about the request URL or the headers.

I'm bumping this to our current version and setting as high priority. Just FYI if this is on the client (ie. app) side to fix, it will not be updated until 5.0.3 is released (currently targeting end of April).

@Tbayer, @JMinor:
For a webrequest to be counted as a Pageview from an app on iOS, it needs:

  • user_agent contains "WikipediaApp"
  • contentType contains "application/json"b
  • Either
      • Tagged as pageview in x_analytics header (pageview=1)
    • OR
      • The path part of the url contains "api.php"
      • User agent contains "iPhone"
      • the query part of the url contains "sections=all"

My guess would be that the in the new iOS App, either/and the query partdoesn't contain "sections=all" anymore, or the endpoint is not api.php anymore.

@Mhurd see @JAllemandou comment above for things to look for in our requests.

@Tbayer @JAllemandou heya I just confirmed the code responsible for the user agent regression :) I'll try to get this patched up so the next release restores "WikipediaApp" to the user agent.

Presently it's setting the user agent like this:

Wikipedia/5.0.2 (iPhone; iOS 9.3.1; Scale/3.00)

Whereas before it set it like this:

WikipediaApp/4.1.5.141 (iPhone OS 9.3; Phone)

Can you work with the Wikipedia/5.0.x string in the mean time to backfill the missing 5.0 data?

To clarify, until we release the 5.0.3 update with the fixed WikipediaApp user agent, you'd have to check for Wikipedia/5.0.0, Wikipedia/5.0.1 and Wikipedia/5.0.2 since all of these user agent variants are in the wild.

@Mhurd: Awesome, thanks for resolving the mystery!

The cleanest solution would be to update the pageview definition and backfill the pageview_hourly and projectview_hourly tables. @JAllemandou, is that possible? Even if I myself manage to reconstruct the metrics I need right now (for the Reading team's quarterly review next Tuesday etc.), all other users that rely e.g. on the total Wikimedia pageview numbers, the pageview API etc. are getting faulty data right now too, most of them probably without being aware of it.

@Tbayer additionally you may be able to search the user agent string for "iOS" and filtering out any user agent strings that contain "Webkit" or "Safari"

@JAllemandou @madhuvishy

The fix on the client will take a bit to develop and release. Is there any way we can back process the existing logs to recover this data using the user agent stings mentioned above?

This would help not just our team, but also others who use this data to have an accurate picture until the fix is out.

@Tbayer: We do not have the ability to backfill pageviews only for iOs and reruning pageviews for all clients and all projects for the last month due to a bug in IOS seems not the best usage of our resources. I am not even sure it is possible to do in a timely manner.

it is unfortunate that pageviews will be imcomplete for IOS but we can make sure to describe issues with the webrequest dataset here:
https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Changes_and_known_problems_since_2015-03-04

Let's add the strings to the user agent match and we will be counting pageviews going forward. @Mhurd: code is in gerrit not github, we can help with changes so we can get those done as fast as possible.

@Tbayer: We do not have the ability to backfill pageviews only for iOs

I'm not sure what was meant by "only for iOS"? The idea would be to add all rows to the pageview_hourly and projectview_hourly tables that are missing there currently because of this bug.

and reruning pageviews for all clients and all projects for the last month due to a bug in IOS seems not the best usage of our resources. I am not even sure it is possible to do in a timely manner.

Even if it is not completed in time for the Reading team's quarterly review (April 12) for which we we need various numbers that are currently not available due to this bug, restoring the validity of the overall pageview data (which is affected too by not counting the vast majority of the app's views) would be benefit any future analyses, both those specific to the app and those examining overall pageviews.

In the meantime, @JAllemandou (thanks for the explanations above!), giving Monte's findings, what would a query for webrequest look like that counts these missing pageviews (with "Wikipedia/5.0..." user agents) in the same way as the pageview_hourly query in the task description?

it is unfortunate that pageviews will be imcomplete for IOS but we can make sure to describe issues with the webrequest dataset here:
https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Changes_and_known_problems_since_2015-03-04

Let's add the strings to the user agent match and we will be counting pageviews going forward. @Mhurd: code is in gerrit not github, we can help with changes so we can get those done as fast as possible.

Even if it is not completed in time for the Reading team's quarterly review (April 12) for which we we need various numbers that are currently not available due to this bug, restoring the validity of the >overall pageview data (which is affected too by not counting the vast majority of the app's views) would be benefit any future analyses, both those specific to the app and those examining overall >pageviews.

I understand, what we are saying is that In our end we have to balance the amount of work and resources involved in the recomputation versus the impact on dataset. In this case the effect on the overall dataset is very easy to quantify and document: pageviews for iOs affect only enwiki on mobile, they do not affect other projects and due to they being of such a small size they also do not affect top article counts. The overall count is affected of course, but to a small extent.

Computation is expensive and doing 1-off counting on cluster due to a bug in the mobile apps is not the best way to proceed on my opinion, more so when you are going to have to calculate numbers by hand for your upcoming report.

This comment was removed by Nuria.

@Nuria

code is in gerrit not github

Oh sorry, I am aware. I just used that link for convenience :)

@Nuria

Could we just change it from this...

private static final Pattern appAgentPattern = Pattern.compile(
    "WikipediaApp"
);

...to this...

private static final Pattern appAgentPattern = Pattern.compile(
    "WikipediaApp|Wikipedia\/5.0.[0-2] "
);

Assuming "Wikipedia/5.0.x " didn't start appearing as a user agent until the time of our initial 5.0 release.

I understand, what we are saying is that In our end we have to balance the amount of work and resources involved in the recomputation versus the impact on dataset.

I'm not sure I understand why this is so costly, and I have no way to engage in a discussion of that. We're changing a single regex for refinery code. Is the issue that re-running the pageviews can only be done across all requests and so all Wikimedia data would have to be reprocessed?

We're really so improverished as an organization that re-running log processing for, as you say, a relatively small amount of data is computationally "too expensive"?

In this case the effect on the overall dataset is very easy to quantify and document: pageviews for iOs affect only enwiki on mobile,

This is not accurate. The app supports all wikipedias, not just enwiki. Though your larger assertion still stands. My follow-up question then, is why bother to even track app usage at all?

The 5.0.0, 5.0.1 and 5.0.2 user agent strings also contain iOS and Scale if that helps:

Wikipedia/5.0.2 (iPhone; iOS 9.3.1; Scale/3.00)

Even if it is not completed in time for the Reading team's quarterly review (April 12) for which we we need various numbers that are currently not available due to this bug, restoring the validity of the >overall pageview data (which is affected too by not counting the vast majority of the app's views) would be benefit any future analyses, both those specific to the app and those examining overall >pageviews.

I understand, what we are saying is that In our end we have to balance the amount of work and resources involved in the recomputation versus the impact on dataset. In this case the effect on the overall dataset is very easy to quantify and document: pageviews for iOs affect only enwiki on mobile, they do not affect other projects

This assessment is incorrect, as Josh pointed out.

and due to they being of such a small size they also do not affect top article counts. The overall count is affected of course, but to a small extent.

Computation is expensive and doing 1-off counting on cluster due to a bug in the mobile apps is not the best way to proceed on my opinion, more so when you are going to have to calculate numbers by hand for your upcoming report.

Not sure I understand your reasoning here. Are you stating that it is impossible for capacity reasons to run a webrequest query as described in my previous comment? Regarding "computation", it would select as many rows as one would need to select for that backfill (I just checked that the app views normally all have agent_type = 'user').

I'm not sure I understand why this is so costly, and I have no way to engage in a discussion of that. We're changing a single regex for refinery code. Is the issue that re-running the pageviews can only be done across all requests and so all Wikimedia data would have to be reprocessed?

Sorry, we are talking about different things. UA has been updated, see: https://gerrit.wikimedia.org/r/#/c/282388/ we merged these changes this morning so as of our next deploy (likely Monday) computations will take new UA into effect

Is the issue that re-running the pageviews can only be done across all requests and so all Wikimedia data would have to be reprocessed?

Yes, pageviews would need to reprocess all data to re-compute iOS in the absence of us running one-off custom code for this scenario.

This is not accurate. The app supports all wikipedias, not just enwiki.

Sorry, mi mistake.

My follow-up question then, is why bother to even track app usage at all?

I think tracking app usage is a sound thing to do, now see my prior comment, in absence of us running special code for this (which has its own cost), we need to re-process all data for all clients which is some instances we simple cannot do if the period of time is long enough.

@Nuria @JAllemandou

UA has been updated, see: https://gerrit.wikimedia.org/r/#/c/282388/ we merged these changes this morning

Does this need to be updated as well?
https://phabricator.wikimedia.org/diffusion/ANRS/browse/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java;bd89d4e20b98d708c341dea03acb8c1eb039674a$89

To this perhaps?

private static final Pattern appAgentPattern = Pattern.compile(
    "WikipediaApp|Wikipedia\/5.0.[0-2] "
);

Apologies if not - I'm not too familiar with this codebase.

@Mhurd : indeed! Thanks for catching that.

Change 282439 had a related patch set uploaded (by Nuria):
Add pageview definition special case for iOs App

https://gerrit.wikimedia.org/r/282439

PS, for reference: If I'm not mistaken, the query used for extracting pageview rows from webrequest into pageview_hourly is at https://github.com/wikimedia/analytics-refinery/blob/master/oozie/pageview/hourly/pageview_hourly.hql ; which is what one would need to emulate entirely (for backfilling) or partly (for calculating the missing numbers).

A discussion IRC has mostly answered my question above ("Are you stating that it is impossible for capacity reasons to run a webrequest query as described in ..."): I now understand that Nuria's concerns about computational resources are not about running such a query per se, but rather about the insertion of the resulting rows into the pageview_hourly (and projectview_hourly) tables.

Fix for the app itself so the next release will report the former "WikipediaApp/" style user agent:
https://github.com/wikimedia/wikipedia-ios/pull/652/

Change 282439 merged by Joal:
Add pageview definition special case for iOs App

https://gerrit.wikimedia.org/r/282439

JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.
JAllemandou moved this task from In Progress to In Code Review on the Analytics-Kanban board.
JAllemandou moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.

@Nuria A friendly reminder about your kind offer from Friday on IRC to paste a webrequest query to calculate the number of pageviews for the app (i.e. for the app versions that had the wrong user agent, for the timespan before the pageview definition update rolls out). See also my earlier question above to @JAllemandou about how such a query could look like, which remains unanswered. Our quarterly review is tomorrow and we need these numbers for that.

Thanks!

@Tbayer : You'll find extracted iOs pageview data from March 10th 15:00 UTC to April 11th 09:00 in joal.pageview_ios in hive :)
The last bit of extraction is currently happening, I'll ping you on IRC when it's ready to be used.

JAllemandou set the point value for this task to 8.Apr 11 2016, 4:20 PM

This can be verified by checking the headers using Charles proxy

@JMinor do we want TSG to verify this?

This task is shared across multiple teams (iOS, Analytics), but has a point value. It should probably be split into two tasks if it has a story points estimate.

@Fjalapeno I'm going to work with TSG on accessibility testing. Adding Charles testing on top of that might be a bit much for now.

I'm going to put into PM sign-off and will test the app side ping myself, and will work with Tbayer to verify this is resolved on the db side.

To keep things together and spell out more clearly some of the progress on this task this week:

  • Many thanks again to @JAllemandou for making the missing iOS pageviews available in a separate table on Monday! I was able to splice that together with the main data just in time for the Reading team's quarterly review the following day (you will find several slides in the appendix that relied on this correction).
  • There is now a separate task for backfilling the data to correct the main pageviews tables directly: T132589 . It is still marked as open, but I noticed while querying projectview_hourly today that the data until April 11 has been changed retroactively since Monday, now matching the manually corrected numbers from earlier. Awesome! Please confirm when the backfilling is complete, so that one knows which parts of the data needs to be corrected and which can be used directly as usual.

@Nuria, you marked this task as resolved. Does this apply to the iOS team's part of the work too?

you marked this task as resolved. Does this apply to the iOS team's part of the work too?

Doesn't seem that there are any items pending. If they are please re-open.

Just want to reiterate that since this task is pointed, it should have a separate task for the iOS board, or else it counts as 8 points against the iOS team's point budget.