Page MenuHomePhabricator

Wikipedia.org Portal Dashboard: investigate recent spike in pageviews
Closed, ResolvedPublic

Description

There has been a recent spike/decrease in pageviews for the Wikipedia.org portal page over the last week or so - let's check it out and see if there is anything going on.

Screen Shot 2016-09-20 at 2.47.24 PM.png (631×1 px, 163 KB)

Live version: http://discovery.wmflabs.org/portal/#pageview_tab

Event Timeline

debt triaged this task as High priority.Sep 20 2016, 8:52 PM
debt updated the task description. (Show Details)

Write-up: https://github.com/wikimedia-research/Discovery-Research-Portal/tree/master/Analyses/Pageviews%20Rise%20(2016-10-03)

I'm sorry I didn't realize the problem until yesterday. Our portal pageview data is problematic:

  • Android pageviews accounts for 70-80% of the total pageviews, which too much...
  • We don't have any portal pageviews whose access method is mobile web, and there are only 141 pageviews whose access method is mobile app during the 60 days period

I doubt that the problem is the result of our query, but I have no idea what might have been wrong...

query <- paste("USE wmf;
                SELECT
                  client_ip,
                  COUNT(1) AS pageviews
                FROM webrequest",
                wmf::date_clause(date)$date_clause, 
               " AND uri_host RLIKE('^(www\\.)?wikipedia.org/*$')
                 AND INSTR(uri_path, 'search-redirect.php') = 0
                 AND content_type RLIKE('^text/html')
                 AND webrequest_source = 'text'
                 AND NOT (referer RLIKE('^http://localhost'))
                 AND agent_type = 'user'
                 AND referer_class != 'unknown'
                 AND http_status IN('200', '304')
                GROUP BY client_ip;")

@mpopov Do you have any suggestion?

  • Android pageviews accounts for 70-80% of the total pageviews, which too much...
  • We don't have any portal pageviews whose access method is mobile web, and there are only 141 pageviews whose access method is mobile app during the 60 days period

I've answered this in IRC. Here's a transcript for future reference:

2:48 PM <bearloga> access_method uses uri host to determine if the request is made to desktop or mobile web. varnish etc. use a set of rules to determine whether to serve the user en.wikipedia.org or en.m.wikipedia.org. unfortunately wikipedia.org is exempt from that. in fact, m.wikipedia.org redirects to en.m.wikipedia.org because somebody a long time ago made that bad decision. so even if you go to wikipedia.org on your iphone, that will still be a accesss_method='desktop' request

I doubt that the problem is the result of our query, but I have no idea what might have been wrong...

query <- paste("USE wmf;
                SELECT
                  client_ip,
                  COUNT(1) AS pageviews
                FROM webrequest",
                wmf::date_clause(date)$date_clause, 
               " AND uri_host RLIKE('^(www\\.)?wikipedia.org/*$')
                 AND INSTR(uri_path, 'search-redirect.php') = 0
                 AND content_type RLIKE('^text/html')
                 AND webrequest_source = 'text'
                 AND NOT (referer RLIKE('^http://localhost'))
                 AND agent_type = 'user'
                 AND referer_class != 'unknown'
                 AND http_status IN('200', '304')
                GROUP BY client_ip;")

@mpopov Do you have any suggestion?

Looking into it right now. Running some queries; will update as I learn stuff.

From #wikimedia-mobile:

10:34 AM <bearloga> I have a question for folks who know about Android. I'm investigating why we get a disproportionate volume of pageviews to Wikipedia.org portal page from Android. the UA with the most pageviews is "Dalvik/2.1.0 (Linux; U; Android 6.0.1; SM-G900F Build/MMB29M)" which confuses me because the wiki page for Dalvik states it was discontinued and ART is the only
10:34 AM <bearloga> runtime as of Android 5. Am I missing something?
10:35 AM <niedzielski> bearloga: dalvik was the old vm
10:36 AM <bearloga> any ideas why I would see Dalvik in use on Android 5 and Android 6 devices? :/
10:37 AM <niedzielski> bearloga: it looks like SM-G900F is a samsung galaxy s5. one website claims it has Android 4.4.2 installed. the dalvik to art change was relatively recent
10:38 AM <niedzielski> bearloga: when samsung upgrades a device, from one major version of android to another, they frequently leave out lots of stuff. i'm not sure if dalvik would be included in that or not
10:38 AM <niedzielski> bearloga: so the os would report Android 5, Android 6, etc but the underlying implementation may be elderly
10:38 AM <mdholloway> i think the UA may still use 'Dalvik' even for ARM versions but I'm not 100% on that
10:39 AM <niedzielski> bearloga: just a guess. if someone has an s5 checking the vm should be easy
10:40 AM <mdholloway> *ART
10:44 AM <bearloga> can anyone with an android device check? are there dev tools that'd let somebody check if ART is reported as Dalvik 2.1.0 on Android 5+?
10:45 AM <niedzielski> dbrant bearND: do either of you happen to have an s5?
10:48 AM <mdholloway> bearloga: i'm looking into it. contra my own suggestion, the UA for chrome on Nougat (Android 7) doesn't contain 'dalvik'. But I'll check out what the system webview is reporting.
10:48 AM <mdholloway> no s5 here unfortunately
10:48 AM <bearloga> also! there appears to be a single android phone (or at least a speedtest/bot that uses that fakes that UA) responsible for 8K pageviews on a single day (a Trooper X55, to be exact). and a few more devices with 4K PVs. can anyone think of anything in android that would be hitting www.wikipedia.org/ so much in one day?
10:49 AM <bearloga> mdholloway: thanks for looking into it
10:50 AM <mdholloway> bearloga: ha, i had a suspicion a bot may be involved... :) we only use 'https://wikipedia.org' as a placeholder return URL in our login/createaccount requests but that shouldn't result in any pageviews. we never actually access the portal in the app.
10:51 AM <mdholloway> bearloga: it might be possible if a clever user messed with the dev settings enough but certainly not something that should happen in normal usage
11:20 AM <mdholloway> bearloga: niedzielski: ok, it looks like android has its own separate system UA that it will send in some or all cases in addition to any application (e.g., browser) UA
11:20 AM <mdholloway> see, for example: https://stackoverflow.com/questions/23804278/browser-sending-dalvik-as-user-agent
11:20 AM <mdholloway> and the system UA will always begin with "Dalvik/", even for ART versions
11:21 AM <mdholloway> see https://github.com/android/platform_frameworks_base/blob/master/core/java/com/android/internal/os/RuntimeInit.java#L193-L221
11:21 AM <mdholloway> (current AOSP master branch)
11:21 AM <mdholloway> still doesn't explain the portal weirdness, but at least clears up the dalvik issue i guess :)
11:22 AM <niedzielski> mdholloway: cool i wonder if this is still true now that the webview is updated by the play store or if that's independent

Here's the Hive query for finding that IP & user agent:

SELECT
  client_ip, user_agent, COUNT(1) AS pageviews
FROM wmf.webrequest
WHERE
  webrequest_source = 'text'
  AND year = 2016 AND month = 10 AND day = 11
  AND content_type RLIKE('^text/html')
  AND http_status IN('200', '304')
  AND uri_host RLIKE('^(www\\.)?wikipedia.org$') AND uri_path = '/' AND uri_query = ''
  AND NOT (referer RLIKE('^http://localhost'))
  AND agent_type = 'user'
  AND referer_class != 'unknown'
  AND REGEXP_REPLACE(CONCAT(user_agent_map['os_family'], ' ', user_agent_map['os_major']), ' -', '') IN('Android 4', 'Android 5', 'Android 6')
GROUP BY client_ip, user_agent
ORDER BY pageviews DESC
LIMIT 1000;

Now I'm gonna take a look at what happens on our side when multiple user agents are sent by Android devices... *sigh* Thanks, Google!

I tried to see if the issue is android devices sending multiple UAs (one for browser, one for system) and nope:

SELECT
  wikipedia_portal_pageviews.client_ip AS client_ip,
  COUNT(DISTINCT(wikipedia_portal_pageviews.user_agent)) AS n_unique_user_agents,
  SUM(IF(INSTR(DISTINCT(wikipedia_portal_pageviews.user_agent), 'Dalvik') > 0, 0, 1)) AS n_nondalvik_user_agents,
  COUNT(1) AS n_total_pageviews
FROM
(
  SELECT
    client_ip, user_agent
  FROM wmf.webrequest
  WHERE
    webrequest_source = 'text'
    AND year = 2016 AND month = 10 AND day = 11
    AND content_type RLIKE('^text/html')
    AND http_status IN('200', '304')
    AND uri_host RLIKE('^(www\\.)?wikipedia.org$') AND uri_path = '/' AND uri_query = ''
    AND NOT (referer RLIKE('^http://localhost'))
    AND agent_type = 'user'
    AND referer_class != 'unknown'
) wikipedia_portal_pageviews
RIGHT JOIN
(
  SELECT
    client_ip
  FROM wmf.webrequest
  WHERE
    webrequest_source = 'text'
    AND year = 2016 AND month = 10 AND day = 11
    AND content_type RLIKE('^text/html')
    AND http_status IN('200', '304')
    AND uri_host RLIKE('^(www\\.)?wikipedia.org$') AND uri_path = '/' AND uri_query = ''
    AND NOT (referer RLIKE('^http://localhost'))
    AND agent_type = 'user'
    AND referer_class != 'unknown'
    AND INSTR(user_agent, 'Dalvik') > 0
  LIMIT 100
) dalvik_ips
ON dalvik_ips.client_ip = wikipedia_portal_pageviews.client_ip
GROUP BY
  wikipedia_portal_pageviews.client_ip;
IP addressNumber of unique UAsNumber of non-Dalvik UAsTotal Pageviews
199.XX.XX.XX127421847628
172.XX.XX.XXX112724876420
172.XX.XX.X930217168763
104.XXX.XXX.XXX7512813558
138.XX.XXX.X642049689
109.XX.XXX.XX370170
............

I've looked up the approx geo coords for one of them, and it could be a building in Atlanta full of a bunch of gov't agencies. Not sure why everyone there would have specifically an Android phone and why each device is going to https://wikipedia.org hundreds of times a day.

@debt: I don't know how much more we can do. We might have to stop this task here because there's no clear point where we can say "this is completely done" and Chelsy & I have collectively sunk so much time into this problem/question.

debt moved this task from In progress to Done on the Discovery-Analysis (Current work) board.

Very interesting...I wonder if the spike could be *somehow* traced to a bunch of techie dudes getting new phones and all of them are pre-loaded with a 'golden' copy of their agency's software that has wikipedia.org as the default home page.

But, alas, I don't think we'll be able to figure that out.

Thanks for all the deep digging, @chelsyx and @mpopov! :)