There has been a recent spike/decrease in pageviews for the Wikipedia.org portal page over the last week or so - let's check it out and see if there is anything going on.
Live version: http://discovery.wmflabs.org/portal/#pageview_tab
There has been a recent spike/decrease in pageviews for the Wikipedia.org portal page over the last week or so - let's check it out and see if there is anything going on.
Live version: http://discovery.wmflabs.org/portal/#pageview_tab
I'm sorry I didn't realize the problem until yesterday. Our portal pageview data is problematic:
I doubt that the problem is the result of our query, but I have no idea what might have been wrong...
query <- paste("USE wmf; SELECT client_ip, COUNT(1) AS pageviews FROM webrequest", wmf::date_clause(date)$date_clause, " AND uri_host RLIKE('^(www\\.)?wikipedia.org/*$') AND INSTR(uri_path, 'search-redirect.php') = 0 AND content_type RLIKE('^text/html') AND webrequest_source = 'text' AND NOT (referer RLIKE('^http://localhost')) AND agent_type = 'user' AND referer_class != 'unknown' AND http_status IN('200', '304') GROUP BY client_ip;")
@mpopov Do you have any suggestion?
- We don't have any portal pageviews whose access method is mobile web, and there are only 141 pageviews whose access method is mobile app during the 60 days period
I've answered this in IRC. Here's a transcript for future reference:
2:48 PM <bearloga> access_method uses uri host to determine if the request is made to desktop or mobile web. varnish etc. use a set of rules to determine whether to serve the user en.wikipedia.org or en.m.wikipedia.org. unfortunately wikipedia.org is exempt from that. in fact, m.wikipedia.org redirects to en.m.wikipedia.org because somebody a long time ago made that bad decision. so even if you go to wikipedia.org on your iphone, that will still be a accesss_method='desktop' request
I doubt that the problem is the result of our query, but I have no idea what might have been wrong...
query <- paste("USE wmf; SELECT client_ip, COUNT(1) AS pageviews FROM webrequest", wmf::date_clause(date)$date_clause, " AND uri_host RLIKE('^(www\\.)?wikipedia.org/*$') AND INSTR(uri_path, 'search-redirect.php') = 0 AND content_type RLIKE('^text/html') AND webrequest_source = 'text' AND NOT (referer RLIKE('^http://localhost')) AND agent_type = 'user' AND referer_class != 'unknown' AND http_status IN('200', '304') GROUP BY client_ip;")@mpopov Do you have any suggestion?
Looking into it right now. Running some queries; will update as I learn stuff.
From #wikimedia-mobile:
10:34 AM <bearloga> I have a question for folks who know about Android. I'm investigating why we get a disproportionate volume of pageviews to Wikipedia.org portal page from Android. the UA with the most pageviews is "Dalvik/2.1.0 (Linux; U; Android 6.0.1; SM-G900F Build/MMB29M)" which confuses me because the wiki page for Dalvik states it was discontinued and ART is the only
10:34 AM <bearloga> runtime as of Android 5. Am I missing something?
10:35 AM <niedzielski> bearloga: dalvik was the old vm
10:36 AM <bearloga> any ideas why I would see Dalvik in use on Android 5 and Android 6 devices? :/
10:37 AM <niedzielski> bearloga: it looks like SM-G900F is a samsung galaxy s5. one website claims it has Android 4.4.2 installed. the dalvik to art change was relatively recent
10:38 AM <niedzielski> bearloga: when samsung upgrades a device, from one major version of android to another, they frequently leave out lots of stuff. i'm not sure if dalvik would be included in that or not
10:38 AM <niedzielski> bearloga: so the os would report Android 5, Android 6, etc but the underlying implementation may be elderly
10:38 AM <mdholloway> i think the UA may still use 'Dalvik' even for ARM versions but I'm not 100% on that
10:39 AM <niedzielski> bearloga: just a guess. if someone has an s5 checking the vm should be easy
10:40 AM <mdholloway> *ART
10:44 AM <bearloga> can anyone with an android device check? are there dev tools that'd let somebody check if ART is reported as Dalvik 2.1.0 on Android 5+?
10:45 AM <niedzielski> dbrant bearND: do either of you happen to have an s5?
10:48 AM <mdholloway> bearloga: i'm looking into it. contra my own suggestion, the UA for chrome on Nougat (Android 7) doesn't contain 'dalvik'. But I'll check out what the system webview is reporting.
10:48 AM <mdholloway> no s5 here unfortunately
10:48 AM <bearloga> also! there appears to be a single android phone (or at least a speedtest/bot that uses that fakes that UA) responsible for 8K pageviews on a single day (a Trooper X55, to be exact). and a few more devices with 4K PVs. can anyone think of anything in android that would be hitting www.wikipedia.org/ so much in one day?
10:49 AM <bearloga> mdholloway: thanks for looking into it
10:50 AM <mdholloway> bearloga: ha, i had a suspicion a bot may be involved... :) we only use 'https://wikipedia.org' as a placeholder return URL in our login/createaccount requests but that shouldn't result in any pageviews. we never actually access the portal in the app.
10:51 AM <mdholloway> bearloga: it might be possible if a clever user messed with the dev settings enough but certainly not something that should happen in normal usage
11:20 AM <mdholloway> bearloga: niedzielski: ok, it looks like android has its own separate system UA that it will send in some or all cases in addition to any application (e.g., browser) UA
11:20 AM <mdholloway> see, for example: https://stackoverflow.com/questions/23804278/browser-sending-dalvik-as-user-agent
11:20 AM <mdholloway> and the system UA will always begin with "Dalvik/", even for ART versions
11:21 AM <mdholloway> see https://github.com/android/platform_frameworks_base/blob/master/core/java/com/android/internal/os/RuntimeInit.java#L193-L221
11:21 AM <mdholloway> (current AOSP master branch)
11:21 AM <mdholloway> still doesn't explain the portal weirdness, but at least clears up the dalvik issue i guess :)
11:22 AM <niedzielski> mdholloway: cool i wonder if this is still true now that the webview is updated by the play store or if that's independent
Here's the Hive query for finding that IP & user agent:
SELECT client_ip, user_agent, COUNT(1) AS pageviews FROM wmf.webrequest WHERE webrequest_source = 'text' AND year = 2016 AND month = 10 AND day = 11 AND content_type RLIKE('^text/html') AND http_status IN('200', '304') AND uri_host RLIKE('^(www\\.)?wikipedia.org$') AND uri_path = '/' AND uri_query = '' AND NOT (referer RLIKE('^http://localhost')) AND agent_type = 'user' AND referer_class != 'unknown' AND REGEXP_REPLACE(CONCAT(user_agent_map['os_family'], ' ', user_agent_map['os_major']), ' -', '') IN('Android 4', 'Android 5', 'Android 6') GROUP BY client_ip, user_agent ORDER BY pageviews DESC LIMIT 1000;
Now I'm gonna take a look at what happens on our side when multiple user agents are sent by Android devices... *sigh* Thanks, Google!
I tried to see if the issue is android devices sending multiple UAs (one for browser, one for system) and nope:
SELECT wikipedia_portal_pageviews.client_ip AS client_ip, COUNT(DISTINCT(wikipedia_portal_pageviews.user_agent)) AS n_unique_user_agents, SUM(IF(INSTR(DISTINCT(wikipedia_portal_pageviews.user_agent), 'Dalvik') > 0, 0, 1)) AS n_nondalvik_user_agents, COUNT(1) AS n_total_pageviews FROM ( SELECT client_ip, user_agent FROM wmf.webrequest WHERE webrequest_source = 'text' AND year = 2016 AND month = 10 AND day = 11 AND content_type RLIKE('^text/html') AND http_status IN('200', '304') AND uri_host RLIKE('^(www\\.)?wikipedia.org$') AND uri_path = '/' AND uri_query = '' AND NOT (referer RLIKE('^http://localhost')) AND agent_type = 'user' AND referer_class != 'unknown' ) wikipedia_portal_pageviews RIGHT JOIN ( SELECT client_ip FROM wmf.webrequest WHERE webrequest_source = 'text' AND year = 2016 AND month = 10 AND day = 11 AND content_type RLIKE('^text/html') AND http_status IN('200', '304') AND uri_host RLIKE('^(www\\.)?wikipedia.org$') AND uri_path = '/' AND uri_query = '' AND NOT (referer RLIKE('^http://localhost')) AND agent_type = 'user' AND referer_class != 'unknown' AND INSTR(user_agent, 'Dalvik') > 0 LIMIT 100 ) dalvik_ips ON dalvik_ips.client_ip = wikipedia_portal_pageviews.client_ip GROUP BY wikipedia_portal_pageviews.client_ip;
IP address | Number of unique UAs | Number of non-Dalvik UAs | Total Pageviews |
199.XX.XX.XX | 1274 | 218 | 47628 |
172.XX.XX.XXX | 1127 | 248 | 76420 |
172.XX.XX.X | 930 | 217 | 168763 |
104.XXX.XXX.XXX | 751 | 28 | 13558 |
138.XX.XXX.X | 642 | 0 | 49689 |
109.XX.XXX.XX | 37 | 0 | 170 |
... | ... | ... | ... |
I've looked up the approx geo coords for one of them, and it could be a building in Atlanta full of a bunch of gov't agencies. Not sure why everyone there would have specifically an Android phone and why each device is going to https://wikipedia.org hundreds of times a day.
@debt: I don't know how much more we can do. We might have to stop this task here because there's no clear point where we can say "this is completely done" and Chelsy & I have collectively sunk so much time into this problem/question.
Very interesting...I wonder if the spike could be *somehow* traced to a bunch of techie dudes getting new phones and all of them are pre-loaded with a 'golden' copy of their agency's software that has wikipedia.org as the default home page.
But, alas, I don't think we'll be able to figure that out.