Page MenuHomePhabricator

Investigate surprising rise in mobile page views for wikidata
Closed, ResolvedPublic

Description

Motivation
Since December 2018 Wikidata is seeing an unprecendented amount of mobile page visits (see this link: https://stats.wikimedia.org/v2/#/wikidata.org/reading/total-page-views/normal|bar|2-Year|access~mobile-web) and it is not quite clear why.

The data claims to exclude bots, and only show human page views, but so far the only possible explanation we could think of is maybe this having to do with google's switch to indexing mobile and not desktop versions of a website

Acceptance Criteria

  • Try to find out, if the web requests table can give us further hints or explanations why the mobile page view data has doubled in the past 4 months.
  • Maybe look into a sample of data as well

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 15 2019, 1:29 PM
Lea_WMDE updated the task description. (Show Details)Apr 16 2019, 1:04 PM
Lea_WMDE updated the task description. (Show Details)

@Lea_WMDE @RazShuty

0. First of all, I started by replicating the Wikistats data (https://stats.wikimedia.org/v2/#/wikidata.org/reading/total-page-views/normal|bar|2-Year|access~mobile-web) that you have shared in the ticket description, to ensure that we are looking at the same data. I have used the Projectview hourly Hive table and the replication is perfect.

1. My first approach was to look into the referer_class field; however, that did not prove to be a fortunate choice, because (as you might see from the following chart) the share of pageviews with referers marked as none (value described as "null, empty or '-'" in the Pageview hourly docs, but not mentioned in the Projectview hourly documentaion) is huge:

2. My next take was the country field of the Pageview hourly. The hypothesis was that even if the filters applied in the processing of the WMF web requests datasets somehow miss to recognize the Googlebot (Smartphone) user agent, if there were many requests on its behalf and if the requests could have been localized via their IPs - then they should have an origin in the USA (sincerely: I have no idea if this hypothesis makes sense or not, because I do not know how Google distributes the servers that run their bots). I have focused on the three months in the data in which we actually see the beginning of the increase in mobile pageviews for Wikidata: 2018-November, 2018-December, and 2019-January (c.f. Wikistats, https://stats.wikimedia.org/v2/#/wikidata.org/reading/total-page-views/normal|bar|2-Year|access~mobile-web). In each month, I have calculated the difference in pageviews from each country for successive days, and then filtered out the twenty such largest increases in pageviews on Wikidata. They are visualized in the following three charts, and you can see that excessive increases in pageviews from the USA happen only in January 2019, while the increase in Wikidata pageviews in general begins in November 2018. So, if my hypothesis is correct, then it is probably not the Googlebot (Smartphone) who is responsible for the increase:

3. Finally, our bot filtering procedures (take a look a the regex pattern that we use) should work for Googlebot (Smartphone), whose full user-agent string encompasses "bot" (in line with the WMF User-Agent policy ) and can be found on this page.

4. I do not think that we could discover more from the Webrequest table than what we see from the refined dataset that I was using here: I could only dive into its user-agentfield to discover the same that the Analytics-EventLogging already know.

5. At this point, and assuming that our recognition of spiders does the job, I don't see how could we explain the increase in the mobile pageviews for Wikidata from the datasets that we have at our disposal. The hypothesis that it's Googlebot (Smartphone) does not seem likely.

@Lea_WMDE @RazShuty

It's definitely not Googlebot (Smartphone), I've checked the wmf.webrequest for a sample:

# - wmf.webrequest dataset: parse user_agent
df = sqlContext.sql('SELECT year, month, day, hour, user_agent, agent_type, is_pageview FROM wmf.webrequest \
                        WHERE (year = 2019 AND month = 3 AND day = 10 AND hour = 1 AND \
                        normalized_host.project_family = "wikidata" AND is_pageview = True)')

df.cache()
df.head(10)

results in:

[Row(year=2019, month=3, day=10, hour=1, user_agent='Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', agent_type='spider', is_pageview=True),
 Row(year=2019, month=3, day=10, hour=1, user_agent='Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', agent_type='spider', is_pageview=True),
 Row(year=2019, month=3, day=10, hour=1, user_agent='Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', agent_type='spider', is_pageview=True),
 Row(year=2019, month=3, day=10, hour=1, user_agent='Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', agent_type='spider', is_pageview=True),
 Row(year=2019, month=3, day=10, hour=1, user_agent='Mozilla/5.0 (Windows NT 10; WOW64; rv:61.0) Gecko/20100101 Firefox/61.0', agent_type='user', is_pageview=True),
 Row(year=2019, month=3, day=10, hour=1, user_agent='Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', agent_type='spider', is_pageview=True),
 Row(year=2019, month=3, day=10, hour=1, user_agent='Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', agent_type='spider', is_pageview=True),
 Row(year=2019, month=3, day=10, hour=1, user_agent='Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', agent_type='spider', is_pageview=True),
 Row(year=2019, month=3, day=10, hour=1, user_agent='Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', agent_type='spider', is_pageview=True),
 Row(year=2019, month=3, day=10, hour=1, user_agent='Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', agent_type='spider', is_pageview=True)]

so Googlebot (Smartphone) is recognized as a spider indeed.

GoranSMilovanovic added a comment.EditedMay 2 2019, 1:14 AM

@Lea_WMDE @RazShuty

I have inspected all mobile pageviews of Wikidata for April 2019.

April 2019 is quite representative for the phenomenon that we are investigating: since the sudden increase in Wikidata mobile pageviews, only in January 2019 there were more of them than in April 2019. And since the data in wmf.webrequest are purged after 90 days, April 2019 seems to be the best sample we have.

By looking at the distribution of pageviews per user_agent field, I cannot see anything strange:

The chart show the top 40 user_agent fields in respect of how many pageviews (the count variable, on the x-axis) were generated by the respective requests. This top 40 encompass 1,816,990 out of 14,703,421 (or: 12,36% approx.) of the mobile pageviews of Wikidata in April 2019.

By checking the this dataset for consistency (again, the origin is the wmf.webrequest table) against Wikistats (https://stats.wikimedia.org/v2/#/wikidata.org/reading/total-page-views/normal|bar|1-Year|access~mobile-web) we find that they match perfectly; so the technical procedures are in place.

Basing on what I see in our datasets, I would say this growth is natural. However let me remind you that my understanding of what can happen in the user_agent field is certainly imperfect. My conclusion is drawn on what I see in the dataset prima facie.

@Lea_WMDE Please advise on the next steps. I can repeat the analysis for March 2019 and February 2019 if deemed necessary.

@GoranSMilovanovic thanks for all the insights! One thing that came to my mind is the following: If the growth is natural, would it be a valid assumption to assume that the editing behavior for mobile also increased? I don't think I can find out through the stats tool whether the number of edits increased on mobile. Would it be possible for you to find out if the number of edits on mobile increased similarly (percentage wise) as desktop in the past year?

@Lea_WMDE

If the growth is natural, would it be a valid assumption to assume that the editing behavior for mobile also increased?

Inspecting this now.

GoranSMilovanovic added a comment.EditedMay 8 2019, 11:32 AM

@Lea_WMDE @RazShuty

Unfortunately, our edits data currently do not encompass any fields that would allow us to separate edits made from mobile vs. desktop.

The approximate workaround could be the following:

  • access the wmf.webrequest table;
  • Collect the following fields:
    • is_pageview (True/False)
    • access_method (mobile app/mobile web/desktop)
    • WHERE normalized_host.project = 'Wikidata' AND access_method = 'mobile web'

and then calculate:

(total requests on Wikidata) - (requests on Wikidata that were recognized as pageviews)

hoping that the difference between the total number of requests and the number of requests that were categorized as pageviews approximates the number of edits.

However, I do not now how approximate exactly this method to estimate the number of edits would be. Please advise.

GoranSMilovanovic added a comment.EditedMay 8 2019, 12:01 PM

@Lea_WMDE @RazShuty

Another possibility would be to parse the X-Analytics field of the wmf.webrequest table and look into the values of the mf-m key:

If set, then the value b indicates that the user is opted into the beta mode (of the mobile site) (mf-m=b), the value amc indicates Advanced Mobile Contributions (mf-m=amc), and b,amc indicates both (mf-m=b,amc). See MobileContext.php.

If someone can help me decide if Advanced Mobile Contributions (amc) - which seem to be tracked in this field of the wmf.webrequests table - indicate mobile user edits, then we have our data.

[EDIT] Upon reading more about the Advanced Mobile Contributions feature, it now does not seem that we could hope to track our data with it.

@Lea_WMDE Currently, I see no way to separate precisely the revisions made form mobile vs desktop; my best candidate is the approximation mentioned in T220977#5166758.

@Milimetric Would you happen to know if there is a convenient method to differentiate between (a) edits made from mobile vs. (b) edits made from desktop, in a particular project (say, Wikidata)? Thank you.

@GoranSMilovanovic it seems really surprising to me that we cannot distinguish how many mobile edits were made from Wikidata. @JAllemandou, Do you maybe have an idea how to do that?

GoranSMilovanovic added a comment.EditedMay 14 2019, 11:30 AM

@Lea_WMDE from Analytics/Data Lake/Edits Wikitech documentation page:

When we import, we grab all the data available from all tables except the revision table, for which we filter by where rev_timestamp <= <<snapshot-date>>

In other words, everything we have on user edits in the Data Lake is a copy of what we have in the traditional (SQL) MediaWiki revision tables per project (see: schema), and we have no field that determines whether a revision was made from mobile or desktop there.

So, I would be surprised to learn that we can distinguish mobile from desktop edits in any project, not just Wikidata.

Hi @Lea_WMDE and @GoranSMilovanovic - I think the answer the your problem is solved in this month snapshot with the revision_tags field of mediawiki_history:

spark.sql("""
SELECT
    substr(event_timestamp, 0, 4) as year,
    array_contains(revision_tags, 'mobile edit') as mobile,
    array_contains(revision_tags, 'mobile app edit')  as mobile_app,
    count(1) as c
FROM wmf.mediawiki_history
WHERE snapshot = '2019-04'
    AND wiki_db = 'wikidatawiki'
    AND event_entity = 'revision'
GROUP BY
    substr(event_timestamp, 0, 4),
    array_contains(revision_tags, 'mobile edit'),
    array_contains(revision_tags, 'mobile app edit')
ORDER BY year, mobile, mobile_app desc
""").show(100, false)

+----+------+----------+---------+                                              
|year|mobile|mobile_app|c        |
+----+------+----------+---------+
|2004|null  |null      |146      |
|2005|null  |null      |495      |
|2006|null  |null      |1838     |
|2007|null  |null      |2814     |
|2008|null  |null      |2384     |
|2009|null  |null      |2175     |
|2010|null  |null      |1650     |
|2011|null  |null      |1354     |
|2012|null  |null      |2912961  |
|2012|false |false     |4        |
|2013|null  |null      |94142292 |
|2013|false |false     |181133   |
|2014|null  |null      |69236941 |
|2014|false |true      |2        |
|2014|false |false     |18174243 |
|2014|true  |false     |51       |
|2015|null  |null      |76088107 |
|2015|false |true      |586      |
|2015|false |false     |26269493 |
|2015|true  |false     |4058     |
|2016|null  |null      |82178134 |
|2016|false |false     |53308675 |
|2016|true  |true      |618      |
|2016|true  |false     |24248    |
|2017|null  |null      |109041593|
|2017|false |false     |83147234 |
|2017|true  |true      |114906   |
|2017|true  |false     |49836    |
|2018|null  |null      |141536855|
|2018|false |false     |67149958 |
|2018|true  |true      |186065   |
|2018|true  |false     |71822    |
|2019|null  |null      |55814156 |
|2019|false |false     |49994060 |
|2019|true  |true      |85968    |
|2019|true  |false     |23867    |
+----+------+----------+---------+
GoranSMilovanovic added a comment.EditedMay 14 2019, 4:36 PM

@JAllemandou You're the man, thank you, I see now that revision_tags is a new field since the 2019-04 (April 2019) snapshot of mediawiki_history:

revision_tags array<string> In revision events: Tags associated to the revision

However, the description in the documentation "...Tags associated to the revision" does not translate to "edits made from mobile, mobile app, desktop, etc" amazingly well :)

@Lea_WMDE I'm on it.

@Lea_WMDE Here's what was happening with the mobile edits since the beginning of the year. Note: the last data point is May 2019, it's incomplete of course.

I've run this as a test only; I will now run a query to gather both 2018 and 2019 data.

@Lea_WMDE Yes, we do have a more or less steady increase in mobile edits on Wikidata:

@GoranSMilovanovic that's great! If we compare the percentage increases (I don't know how this is done best statistically, but I'm sure you do :) ), do we get a similar trend between mobile edits and mobile page views?
And is it possible for us to roughly group this data by where people come from and if we see a significant increase for a specific region of the world?

GoranSMilovanovic added a comment.EditedMay 16 2019, 12:16 PM

@Lea_WMDE Here we go:

  • the following chart shows mobile edits vs. mobile pageviews separately for users and spiders;
  • what we can learn from this chart is that the growth is certainly natural, given that the spiders have made a minimal number of edits and contributed to pageviews much less than our users did;
  • the pattern of user mobile edits (left panel, blue line) seems similar to the pattern of user mobile pageviews (right panel, blue line), but the subsequent analysis has uncovered that they are in fact unrelated.

Now, the following chart shows us the monthly percent change in the growth of edits and pageviews, for users only:

  • To help understand it, the percents in the chart are: (count in the current month - count in the previous month)/count in the current month*100.
  • Now we see that the growth in user mobile pageviews and the growth in the user mobile edits are not really correlated.
  • In order to reject the hypothesis that the two time-series (i.e. % growth of pageviews and % growth of edits) are correlated, both time-series were first differentiated (lag = 1), and then both Pearson and Spearman's correlation coefficients were assessed; neither reached statistical significance (which in effect means that from the viewpoint of classical statistics the correlation does not exit).
  • Caveat: we have only six (6) observations, so the sample is questionable.

And is it possible for us to roughly group this data by where people come from and if we see a significant increase for a specific region of the world?

I can certainly do that for the pageviews (there are handy fields in the Projectview hourly Hive table like continent and country_code), but I am not sure if I have the data to do it for revisions. @JAllemandou ?

@GoranSMilovanovic great, thanks! We definitely know more now :)

A lot trickier :)
We have the wmf_raw.mediawiki_private_cu_changes table in hive, allowing us to compute geo-editors (editors-by-country, aggregated). This table only contains 3 month of data for PII removal reasons. It's probably not enough for what you're after, but I have nothing better (see https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/geoeditors/monthly/insert_geoeditors_monthly_data.hql for an example).
I've just created T223444 to submit the general idea of having geo-editors stats split by desktop/mobile.

@JAllemandou Thanks for feedback!

@Lea_WMDE Given the current situation with the geo-localized edits (see T220977#5186818), do you want me to proceed with the per continent analysis for pageviews now, or shall we wait until we can encompass both the pageviews and revisions in a single report?

To clear up what Joseph said, we're never going to have more than 90 days of geolocated edits for privacy reasons. We do have two aggregated datasets that go back more than a year:

https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors#Aggregated_data

@Lea_WMDE Do we have any additional requirements here or shall we resolve the ticket?

Lea_WMDE closed this task as Resolved.Jul 3 2019, 11:05 AM