Analysis on traffic through the HTTPS transition
Closed, ResolvedPublic

Description

do a traffic analysis for Catalan, Chinese, English, Hebrew, Italian, and Uyghur versions of Wikipedia. Lila is asking for a traffic report. Chinese & Uyghur Wikipedia transition was on Tuesday. Catalan, Chinese, Hebrew, and Italian yesterday. English at 2AM today. We also need to know traffic impact on English Wikipedia geolocated to China.

kevinator updated the task description. (Show Details)
kevinator raised the priority of this task from to Needs Triage.
kevinator assigned this task to ellery.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 15 2015, 4:17 AM
kevinator added a comment.EditedJun 15 2015, 4:18 AM

Timeline provided by @BBlack:
Languages:---
The Chinese languages are Chinese (zh) and Uyghur (ug)
The HTTPS-Beta languages are Catalan (ca), Hebrew (he), Greek (el), Italian (it).

Timeline:---

2015-06-09 21:23 UTC: transitioned Chinese languages, all projects
2015-06-11 14:00 UTC: transitioned HTTPS-Beta language Wikipedias
2015-06-11 14:30 UTC: transitioned HTTPS-Beta language Mobile Wikipedias
2015-06-12 08:43 UTC: Routing Incident: First evidence (others noticing, not us)
2015-06-12 09:00 UTC: Routing Incident: Level3 fallout in full effect, many notice
2015-06-12 09:00 UTC: Start transition of English Wikipedia, including Mobile

During this 40 minute window for English, we first redirected 10% of clients, then 50%, then 100%

2015-06-12 09:40 UTC: End transition of English Wikipedia, including Mobile
2015-06-12 10:40 UTC: Routing Incident: Largely resolved, some smaller trailing effects
2015-06-12 13:00 UTC: Public blog announcement
2015-06-12 13:30 UTC: transitioned All other projects (e.g. wikiversity, wikibooks, etc) for English + Beta languages

@JAllemandou can you point Ellery to the correct tables to use in Hive for this.

kevinator triaged this task as Unbreak Now! priority.Jun 15 2015, 4:27 AM
kevinator set Security to None.

@ellery depending on how you need to handle chinese languages, you should either go for:

  • project pre-aggregated table (no dialect/language_variant, preproduction mode, very small data)
    • hive: joal.pageview_hourly
    • hdfs parquet files: /user/joal/pageview/hourly/year=2015/month=6/day=X/hour=X
  • pageview pre-aggregated table (dialect/language_variant available, production mode, medium-small data)
    • hive: wmf.pageview_hourly
    • hdfs parquet files: /wmf/data/wmf/pageview/hourly/year=2015/month=6/day=X/hour=X

Let me know if you want me to spend some time with you :)

@JAllemandou The table joal.pageview_hourly would be perfect if it had a http/https dimension! Also, have you got the joal.pageview_hourly table in a pentaho cube somewhere?

I imagine you are after the https info.
Unfortunately it's not included in any pre-aggregated table.

I don't have loaded this data into pentaho neither.

ellery added a comment.EditedJun 15 2015, 8:15 PM

Ok, I just fired off a query to get https status as well. I am running over the logs for June (sampling 1 out of 64 buckets). This query is estimated finish in 3 days....

SELECT year,
month,
day,
hour,
uri_host,
geocoded_data['country'] as country,
access_method,
agent_type,
x_analytics_map['https'] as https,
http_status,
count(*) as n
FROM wmf.webrequest TABLESAMPLE(BUCKET 1 OUT OF 64 ON rand())
WHERE year = 2015
AND month = 6
AND webrequest_source in ('mobile', 'text')
AND is_pageview = 1
AND uri_host RLIKE '(ca|en|zh|it|ug|he)\\.(m\\.)?wikipedia'
GROUP BY uri_host, geocoded_data['country'], x_analytics_map['https'], http_status,
access_method, agent_type, year, month, day, hour;

As mentioned above, the query will still run for a few days. But here are some preliminary results from a query I kicked off on Friday.

Summary: We see a drop in pageviews from zhwiki from Chinese desktop users and US bots. All beta language projects seem unaffected. The data is too right terminated to evaluate the change for enwiki.

https://github.com/ewulczyn/wmf/blob/master/https_transition/https_transition.ipynb

The transition is still ongoing. New timeline events from today (assume for any language mentioned, it's for all projects that language has):

All times UTC, and +/- 5 mins:
2015-06-15 20:15 - Wikidata and Roots/www (see below)
2015-06-15 20:25 - de
2015-06-15 21:00 - Commons
2015-06-15 21:25 - fr, ja
2015-06-15 23:15 - Reverted Commons ( for now, due to: T102566 )
2015-06-16 00:00 - bg, cs, eo, fi, id, nl, no, pl, pt, sv, th, tr

Roots/www means any of our primary domains without a language prefix, including mobile, as well as www in place of the language prefix. e.g. http://wikipedia.org, http://www.wikiversity.org, http://m.wikibooks.org, etc. Mostly these are language-selector pages or language-detecting redirects.

faidon added a subscriber: faidon.Jun 16 2015, 5:21 PM
ellery added a comment.EditedJun 18 2015, 12:11 AM

For a set of updated graphs that include the protocol dimensions and the effects on enwiki see:

https://github.com/ewulczyn/wmf/blob/master/https_transition/https_transition.ipynb

ellery moved this task from Staged to Paused on the Research-and-Data board.Jun 24 2015, 9:22 PM

I have updated the graphs in https://github.com/ewulczyn/wmf/blob/master/https_transition/https_transition.ipynb.

Iran shows a severe persistent drop in pageview rates for enwiki. China has a less severe but still persistent drop.

@ellery thanks for the great work. Can you add a short section with the main takeaways at the top of the nb? Other than country-specific data, given that bot traffic historically accounted for up to half of PVs from the US, this will result in a major drop in the legacy PVs and we'll need to communicate this clearly.

@kevinator is Yana responsible for presenting the results internally?

ellery moved this task from Paused to Done on the Research-and-Data board.Jul 30 2015, 10:14 PM
Tbayer added a subscriber: Tbayer.Jul 31 2015, 8:30 PM
kevinator moved this task from Next Up to Radar on the Analytics-Kanban board.Aug 1 2015, 12:41 AM
Milimetric moved this task from Incoming to Radar on the Analytics-Backlog board.

This task has "Unbreak Now!" priority for three months now which means it "needs to be fixed immediately, setting anything else aside."

@ellery: What is the status of this task? Is the priority still correct?

@Aklapper this task is complete.

Aklapper closed this task as Resolved.Nov 30 2015, 11:08 AM

@Aklapper this task is complete.

@ellery: Is there a reason to not change the task status to "Resolved" via the "Action > Change Status" dropdown above the "Comments" box, so the task won't show up under the list of open tasks anymore?

I'll just do this and resolve this task, hoping I understood your previous comment correctly. Please reopen if I'm wrong.