Page MenuHomePhabricator

Re-process webrequests from 2020-05-18 so that page views from latest Wikipedia app releases are counted
Closed, DeclinedPublic


Once the pageview definition is updated in T256514 and deployed in T256515, we will need to re-count pageviews since any pages viewed by users on latest releases of Android & iOS apps (which use the new /page/mobile-html endpoint) were not counted.

Event Timeline

In our sync with AE we did talk about the volume of data and computational + labor/time costs of such an undertaking. First and foremost our teams' priority is to get the updated definition deployed so those requests can start getting counted correctly.

We are potentially looking at not re-counting and taking the L there, but figuring out those pageview counts separately and annotating where needed.

mpopov moved this task from Tracking to Needs Investigation on the Product-Analytics board.
mpopov added a subscriber: SNowick_WMF.

need feedback from @SNowick_WMF's conversations with apps PMs

Given that after the changes done to the pageview definition for the apps ( the pageview make up of mobile pageviews has changed significantly (mobile app pageviews are lower in total, automated traffic is higher) I think we should not backfill, the definition we have now is quite imperfect and I suspect (need to verify), buggy.

There's at least one bug with the updated definition and it's my fault. The iOS app is sending the wrong User-Agent on pageview requests. When reviewing the patch that updated the server-side definition, I verified the User-Agent matched, but only before the request was handed off to the system web view. The web view changes the User-Agent to it's internal one, which is something along the lines of:
Mozilla/5.0 (iPhone; CPU iPhone OS 13_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148
or on the iPad:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko)
And not the app default of:
WikipediaApp/ (iOS 12.4; Phone)

The other issue with both apps (and this was an issue with the prior definition as well) is that the same request is made when saving an article for viewing offline. So if a user logs into a new device and syncs their saved articles, every download of one of those articles is a "pageview". This would be fixed by adding the X-Analytics: pageview=1 header to only the requests where the user is actually viewing a page.

I agree that as @Nuria stated on other tickets that it'd be preferable to send an event for pageviews, so hopefuly we can institute that as the long term fix. Short term we can push an update to the iOS app that fixes the User-Agent issue (T257389), and to both apps that adds the X-Analytics header (T256507)

This comment was removed by Nuria.

@JoeWalsh I think we might have a few more problems:

The number of agents doing > 2000 pageviews a day (which was about the maximun of pageviews per day we had before) is about 200 , all this data gets marked as 'automated' so it is not counted as user pageviews but before it just did not exist. I think it indicates the presence of a bug in the app that did not existed before.

Traffic looks like the following for about 20,000 secs for 1 actor/device

|2020-07-05 13:25:48|null     ||/api/rest_v1/page/mobile-html/Wilh._Wilhelmsen|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 13:25:49|null     ||/api/rest_v1/page/mobile-html/Wilh._Wilhelmsen|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 13:25:50|null     ||/api/rest_v1/page/mobile-html/Wilh._Wilhelmsen|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 13:25:51|null     ||/api/rest_v1/page/mobile-html/Wilh._Wilhelmsen|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 13:25:52|null     ||/api/rest_v1/page/mobile-html/Wilh._Wilhelmsen|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 13:25:53|null     ||/api/rest_v1/page/mobile-html/Wilh._Wilhelmsen|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 13:25:54|null     ||/api/rest_v1/page/mobile-html/Wilh._Wilhelmsen|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 13:25:55|null     ||/api/rest_v1/page/mobile-html/Wilh._Wilhelmsen|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 13:25:56|null     ||/api/rest_v1/page/mobile-html/Wilh._Wilhelmsen|WikipediaApp/ (iOS 13.5.1; Phone)|

There is about 200 other devices for which traffic looks about the same (thousands of pageviews fired rapidly, one per sec). Couple other examples:

2020-07-05 18:40:16|null     ||/api/rest_v1/page/mobile-html/Harry_R._Truman|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 18:40:17|null     ||/api/rest_v1/page/mobile-html/Harry_R._Truman|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 18:40:19|null     ||/api/rest_v1/page/mobile-html/Harry_R._Truman|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 18:40:19|null     ||/api/rest_v1/page/mobile-html/Harry_R._Truman|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 18:40:21|null     ||/api/rest_v1/page/mobile-html/Harry_R._Truman|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 18:40:22|null     ||/api/rest_v1/page/mobile-html/Harry_R._Truman|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 18:40:23|null     ||/api/rest_v1/page/mobile-html/Harry_R._Truman|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 18:40:24|null     ||/api/rest_v1/page/mobile-html/Harry_R._Truman|WikipediaApp/ (iOS 13.5.1; Phone)|
2020-07-05 00:16:14|null     ||/api/rest_v1/page/mobile-html/Sarah_Dunsworth-Nickerson|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 01:41:20|null     ||/api/rest_v1/page/mobile-html/Sarah_Dunsworth-Nickerson|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 01:41:21|null     ||/api/rest_v1/page/mobile-html/Sarah_Dunsworth-Nickerson|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 01:41:21|null     ||/api/rest_v1/page/mobile-html/Sarah_Dunsworth-Nickerson|WikipediaApp/ (iOS 13.5.1; Phone)|
|2020-07-05 01:41:22|null     ||/api/rest_v1/page/mobile-html/Sarah_Dunsworth-Nickerson|WikipediaApp/ (iOS 13.5.1; Phone)|

These are the user agents that do more than 3000 pageviews per day:

|WikipediaApp/ (iOS 12.4; Phone)                                                         |
|WikipediaApp/ (iOS 12.3.1; Phone)                                                       |
|WikipediaApp/ (iOS 13.4.1; Phone)                                                       |
|WikipediaApp/ (iOS 13.5.1; Phone)                                                       |
|WikipediaApp/ (iOS 13.5.1; Tablet)                                                      |
|WikipediaApp/ (iOS 13.3; Phone)                                                         |
|WikipediaApp/ (iOS 13.5.1; Phone)                                                       |
|WikipediaApp/ (iOS 13.5.1; Tablet)

So you know the percentiles for pageviews before were: p50:2 , p90: 9 and p99: 37, now the p50 and p90 have not changed while the p99 is about 47 view. This actually seems a likely change. Now, the longtail (p99.9 and above is just completely skewed to the requests I am describing)

Also (now a different problem). See a Session of someone with a large number of pageview, by looking at how close request times are to each other in some instances I am guessing many of these might be preloads?

2020-07-05 06:06:07|null     ||/api/rest_v1/page/mobile-html/V%C3%A9hicule_de_l%E2%80%99avant_blind%C3%A9                                                       |
|2020-07-05 06:06:10|null     ||/api/rest_v1/page/mobile-html/Pugwash_Conferences_on_Science_and_World_Affairs                                                   |
|2020-07-05 06:06:11|null     ||/api/rest_v1/page/mobile-html/Havariekommando                                                                                    |
|2020-07-05 06:06:13|null     ||/api/rest_v1/page/mobile-html/B%C3%B6lkow_Bo_105                                                                                 |
|2020-07-05 06:06:31|null     ||/api/rest_v1/page/mobile-html/Tetrahydrocannabinol                                                                               |
|2020-07-05 06:06:33|null     ||/api/rest_v1/page/mobile-html/Hydrophon                                                                                          |
|2020-07-05 06:06:35|null     ||/api/rest_v1/page/mobile-html/Torpedoversuchsanstalt_Surendorf                                                                   |
|2020-07-05 06:07:55|null     ||/api/rest_v1/page/mobile-html/Shenyang_J-8                                                                                       |
|2020-07-05 06:16:08|null     ||/api/rest_v1/page/mobile-html/James_Bond                                                                                         |
|2020-07-05 06:16:17|null     ||/api/rest_v1/page/mobile-html/James_Bond_%E2%80%93_007_jagt_Dr._No                                                               |
|2020-07-05 06:20:09|null     ||/api/rest_v1/page/mobile-html/Bikini                                                                                             |
|2020-07-05 06:21:30|null     ||/api/rest_v1/page/mobile-html/Nansen-Pass                                                                                        |
|2020-07-05 06:22:39|null     ||/api/rest_v1/page/mobile-html/Schacht%C3%BCrke                                                                                   |
|2020-07-05 06:25:45|null     ||/api/rest_v1/page/mobile-html/Richard_Meinertzhagen                                                                              |
|2020-07-05 12:08:04|null     ||/api/rest_v1/page/mobile-html/Jean-Paul_Belmondo                                                                                 |
|2020-07-05 12:08:06|null     ||/api/rest_v1/page/mobile-html/Eurocopter_Tiger                                                                                   |
|2020-07-05 12:08:09|null     ||/api/rest_v1/page/mobile-html/Wehrtechnische_Dienststelle_f%C3%BCr_Schiffe_und_Marinewaffen%2C_Maritime_Technologie_und_Forschung|
|2020-07-05 12:08:11|null     ||/api/rest_v1/page/mobile-html/FN_FAL                                                                                             |
|2020-07-05 12:08:14|null     ||/api/rest_v1/page/mobile-html/Peter_Brotherhood                                                                                  |
|2020-07-05 12:08:16|null     ||/api/rest_v1/page/mobile-html/Hexengrund_(Torpedowaffenplatz)                                                                    |
|2020-07-05 12:08:18|null     ||/api/rest_v1/page/mobile-html/Schwertwal_(U-Boot)                                                                                |
|2020-07-05 12:08:19|null     ||/api/rest_v1/page/mobile-html/AEC_(Panzersp%C3%A4hwagen)                                                                         |
|2020-07-05 12:08:21|null     ||/api/rest_v1/page/mobile-html/M132_Armored_Flamethrower                                                                          |
|2020-07-05 12:08:22|null     ||/api/rest_v1/page/mobile-html/Warrior_(Panzer)                                                                                   |
|2020-07-05 12:08:24|null     ||/api/rest_v1/page/mobile-html/Numismatik                                                                                         |
|2020-07-05 12:08:25|null     ||/api/rest_v1/page/mobile-html/Friedrich_Kre%C3%9F_von_Kressenstein_(General_der_Artillerie)                                      |
|2020-07-05 12:08:26|null     ||/api/rest_v1/page/mobile-html/Hagana                                                                                             |
|2020-07-05 12:28:33|null     ||/api/rest_v1/page/mobile-html/Daimler_Dingo                                                                                      |
|2020-07-05 12:28:43|null     ||/api/rest_v1/page/mobile-html/Humber_(Panzersp%C3%A4hwagen)                                                                      |
|2020-07-05 12:30:05|null     ||/api/rest_v1/page/mobile-html/FV_432                                                                                             |
|2020-07-05 12:30:26|null     ||/api/rest_v1/page/mobile-html/FV603_Saracen                                                                                      |
|2020-07-05 15:27:35|null     ||/api/rest_v1/page/mobile-html/Humber_Pig                                                                                         |
|2020-07-05 15:27:38|null     ||/api/rest_v1/page/mobile-html/Magach                                                                                             |
|2020-07-05 15:27:40|null     ||/api/rest_v1/page/mobile-html/Saliromanie                                                                                        |
|2020-07-05 15:27:41|null     ||/api/rest_v1/page/mobile-html/Merkava                                                                                            |
|2020-07-05 15:27:45|null     ||/api/rest_v1/page/mobile-html/Boeing-Vertol_CH-46                                                                                |
|2020-07-05 15:27:47|null     ||/api/rest_v1/page/mobile-html/USS_Kearsarge_(LHD-3)                                                                              |
|2020-07-05 15:27:49|null     ||/api/rest_v1/page/mobile-html/Grumman_C-2                                                                                        |
|2020-07-05 15:34:30|null     ||/api/rest_v1/page/mobile-html/FV603_Saracen                                                                                      |
|2020-07-05 15:42:28|null     ||/api/rest_v1/page/mobile-html/Westland_Lynx                                                                                      |
|2020-07-05 15:46:12|null     ||/api/rest_v1/page/mobile-html/America-Klasse_(2012)                                                                              |
|2020-07-05 15:49:32|null     ||/api/rest_v1/page/mobile-html/Bell_AH-1                                                                                          |
|2020-07-05 15:50:55|null     ||/api/rest_v1/page/mobile-html/Marine_Expeditionary_Unit                                                                          |
|2020-07-05 17:06:58|null     ||/api/rest_v1/page/mobile-html/Lockheed_AH-56                                                                                     |
|2020-07-05 17:07:01|null     ||/api/rest_v1/page/mobile-html/US-Invasion_in_Grenada                                                                             |
|2020-07-05 17:07:05|null     ||/api/rest_v1/page/mobile-html/Eurocopter_Dauphin                                                                                 |
|2020-07-05 17:08:29|null     ||/api/rest_v1/page/mobile-html/Airbus_Helicopters_H155                                                                            |
|2020-07-05 17:15:11|null     ||/api/rest_v1/page/mobile-html/GTK_Boxer                                                                                          |
|2020-07-05 17:20:20|null     ||/api/rest_v1/page/mobile-html/High_Mobility_Multipurpose_Wheeled_Vehicle                                                         |
|2020-07-05 17:24:02|null     ||/api/rest_v1/page/mobile-html/EMT_Luna                                                                                           |
|2020-07-05 17:25:13|null     ||/api/rest_v1/page/mobile-html/Operation_Oqab                                                                                     |
|2020-07-05 17:27:05|null     ||/api/rest_v1/page/mobile-html/AMR_35                                                                                             |
|2020-07-05 17:55:18|null     ||/api/rest_v1/page/mobile-html/James_Bond_007_%E2%80%93_Feuerball                                                                 |
|2020-07-05 17:56:21|null     ||/api/rest_v1/page/mobile-html/James_Bond_007_%E2%80%93_Goldfinger                                                                |
|2020-07-05 18:26:16|null     ||/api/rest_v1/page/mobile-html/Ford_M151_MUTT                                                                                     |
|2020-07-05 18:26:49|null     ||/api/rest_v1/page/mobile-html/Lamborghini_LM002                                                                                  |
|2020-07-05 18:28:08|null     ||/api/rest_v1/page/mobile-html/Gama_Goat                                                                                          |
|2020-07-05 18:28:18|null     ||/api/rest_v1/page/mobile-html/M274_Mechanical_Mule                                                                               |
|2020-07-05 18:29:06|null     ||/api/rest_v1/page/mobile-html/Volvo_C303                                                                                         |
|2020-07-05 18:38:46|null     ||/api/rest_v1/page/mobile-html/Mowag_Eagle                                                                                        |
|2020-07-05 18:38:48|null     ||/api/rest_v1/page/mobile-html/Oshkosh_JLTV

Will post some more updates tomorrow.

Hi @Nuria

Thank you for flagging these issues before re-processing the data and for providing the additional context and evidence.

There is about 200 other devices for which traffic looks about the same (thousands of pageviews fired rapidly, one per sec)

This definitely is a bug if a single device is requesting the same page in rapid succession. The titles all having a . or a - seemed like a clue, but I wasn’t able to reproduce the issue by saving the same articles for offline viewing. There could be some other confounding factor. Is there any other information about the requests, like the response code?

Also (now a different problem). See a Session of someone with a large number of pageview, by looking at how close request times are to each other in some instances I am guessing many of these might be preloads?

With the articles being different, this looks like the expected behavior of the offline reading feature. If a user logs in on a new device and has a synced reading list, all of the articles on that list will be saved for offline viewing. If this is happening more often than before, it’s possible there’s a bug causing more users to have to re-login and re-download their articles than usual.

The root cause in both cases is the issue I mentioned before - the same request counted by the pageview definition is used for saving articles for offline viewing. This was an issue with the previous pageview definition and previous versions of the app as well. If we do need to re-process this data, there would need to be a determination of whether we keep the same approach as before that mistakenly counts saving for offline or try to truly only count pageviews.

I believe I have reproduction steps:

  1. It's via the Explore Feed - on a "random article of the day" which has interesting characters (not parenthesis, but so far most other special characters have the problem) in the title.
  2. Do NOT tap the “save” button in the explore feed. That works fine.
  3. Instead, tap into the actual article. Then tap “Save” in the article's toolbar. Commas and slashes are now escaped, and presumably so are periods. (But there weren’t any articles in En or Test in the last month - as far as Explore's random went back - with periods/dashes to confirm this. Parentheses do not get escaped, and work as expected.)

To show this, put breakpoints in MWKSavedPageList.m - in both toggleSavedPageForURL and toggleSavedPageForKey. You're looking for a URL or Key (respectively) variable with escaped characters. Notably, tapping “save” in the feed for a random article uses toggleSavedPageForKey, but saving in the article uses toggleSavedPageForURL - which might be why we get different behavior based on the route.

My process, in case anyone wants to see me show my work:
@JoeWalsh mentioned that if you escape periods in fetchOrCreateArticleWithURL, you get the bad behavior. So I worked backwards from there. fetchOrCreateArticleWithURL is called from a few places: 

  • MWKSavedPageList, within toggleSavedPage (in objc, toggleSavedPageWithURL). In turn, that's called within -
    • ArticleVC’s toolbar - this works as expected on an article w/ a period that I tested
    • ArticleViewController+LinkPreviewing (peek and pop) - this works as expected on an article w/ a period that I tested
    • PlacesViewController - this works as expected on an article w/ a period that I tested
    • FeaturedArticleWidget - this works as expected on an article w/ a dash that I tested
  • WMFRandomContentSource
    • Digging into this one... and that's when we discovered the repro steps.
  • WMFFeedContentSource
  • WMFOnThisDayContentSource

I believe I have reproduction steps:


I looked at requests more in detail and the only thing it looked a bit strange was the content types. For an hour this is the distribution of content types and requests:

text/html; charset=utf-8; profile=""           |261337  |
|application/json; charset=utf-8                                                                       |15289   |
|-                                                                                                     |6176    |
|application/json; charset=utf-8; profile=""|17506   |
|text/html; charset=UTF-8                                                                              |3104

the same request counted by the pageview definition is used for saving articles for offline viewing

I see, as we talked about this issue would be solved when we use events for pageviews.

Ignore my repro steps, they do not actually work.

Let me dig out which are urls frequently requested (in the thousands) to see if we find a pattern.

Looking at this list below it seems to me we have a bug related to urls with punctuation marks. See url list with number of pageviews for all users in 1 day

|/api/rest_v1/page/mobile-html/Wilh._Wilhelmsen                                                 |20616 
|/wiki/Wikipedia%3AHauptseite                                                                   |19410   |
|/api/rest_v1/page/mobile-html/Milano%2C_Texas                                                  |19024   |
|/api/rest_v1/page/mobile-html/Harry_R._Truman                                                  |16064   |
|/api/rest_v1/page/mobile-html/Alexander_Hamilton                                               |15771   |
|/api/rest_v1/page/mobile-html/David_B._Goldstein_(geneticist)                                  |15483   |
|/api/rest_v1/page/mobile-html/Seung-Hui_Cho                                                    |15106   |
|/api/rest_v1/page/mobile-html/Hamilton_%28musical%29                                           |14136   |
|/api/rest_v1/page/mobile-html/The_Portage_to_San_Cristobal_of_A.H.                             |14136   |
|/api/rest_v1/page/mobile-html/P._T._Barnum                                                     |13843   |
|/api/rest_v1/page/mobile-html/Robert_Downey_Jr.                                                |12300   |
|/api/rest_v1/page/mobile-html/Gerald_R._Ford-class_aircraft_carrier                            |10941   |
|/api/rest_v1/page/mobile-html/History_of_Norwich_City_F.C.                                     |9852    |
|/api/rest_v1/page/mobile-html/Ryan_W._Ferguson                                                 |9406    |
|/api/rest_v1/page/mobile-html/J._Paul_Getty                                                    |8966    |
|/api/rest_v1/page/mobile-html/James_F._D._Lanier_Residence                                     |8555    |
|/api/rest_v1/page/mobile-html/Porsche_Boxster%2FCayman                                         |8384    |
|/api/rest_v1/page/mobile-html/Willy_Wonka_%26_the_Chocolate_Factory                            |8266    |
|/api/rest_v1/page/mobile-html/Emtricitabine%2Ftenofovir                                        |8121    |
|/api/rest_v1/page/mobile-html/Barclay_James_Harvest%2FDiskografie                              |8104    |
|/api/rest_v1/page/mobile-html/O._J._Simpson                                                    |7834    |
|/api/rest_v1/page/mobile-html/Antiochos_III._Meg%C3%A1s                                        |7534    |
|/api/rest_v1/page/mobile-html/J._Jonah_Jameson                                                 |7513    |
|/api/rest_v1/page/mobile-html/Fiat_S.p.A.                                                      |7271    |
|/api/rest_v1/page/mobile-html/Thomas_R._Marshall                                               |7025

That makes sense, thanks so much for pulling the data! We'll be able to use this to ensure we have a complete fix for the problem.

fdans moved this task from Incoming to Data Quality on the Analytics board.

Declining this task:

  • Too many data issues arose when trying to track/understand pageviews in the app (during the period when they were no longer flagged appropriately to be tracked as part of our overall pageviews counts)
  • Extensive work by multiple engineers & data scientists would be required to backfill

Instead, we have decided to make clear annotations in pageviews_hourly about apps interactions not being counted. More information should be available in T256804.