Page MenuHomePhabricator

Create Autonomous Systems ranking based on RUM data
Open, HighPublic

Description

Such ranking should be fair to the networks/ISPs and compensate for confounding factors, like the fact that an ISP that offers cheaper plans would correlate with users that have cheaper, less-performing devices.

We need to come up with a score that represents the ISP's responsibility in the performance mix, eliminating our own or the device's.

Event Timeline

Gilles created this task.Nov 19 2018, 4:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 19 2018, 4:01 PM
Gilles triaged this task as Normal priority.Nov 19 2018, 4:01 PM
Gilles added a comment.EditedNov 19 2018, 4:15 PM

Given the data we have, I think we can use:

MaxMind to map IPs to ASNs

I'll start with France as an initial case study, as it's the market I know best and will be able to check how ASNs map to real ISPs, see what happens with MVNOs and also compare the results to the ISPs' existing reputation.

Without any change to data retention, a "live" model will therefore be limited to 90 days of data. I think that's fine, as a monthly or quarterly update of the ranking would be great. We certainly don't need to update it more frequently.

MaxMind to map IPs to location

This should allow us to approximate the physical distance between the client and the nearest Wikimedia datacenter. Which, when combined with the speed of light, gives us the absolute minimum achievable in a perfect world. We can deduct that (times the minimum amount of roundtrips) from the effective TTFB, to compensate for our own responsibility for where we've physically located our edge caches.

Ideally we would look at the distance to the DC the user was actually routed to, as the DNS georouting is also our own responsibility.

TTFB

In the form of responseStart - connectStart. This would limit our own client-side code's responsibility in potential slowdowns, by focusing only on the first roundtrips.

Effective connection type

The effective connection type ratios should be quite representative of the ISP's infrastructure.

Device memory

This should allow us to get a sense of how overloaded a particular device might be, which isn't the ISP's fault. If practical, it could be used to simply filter out devices where the available memory is low.

CPU benchmark

Either through blacklisting, bucketizing or factoring in percentiles into the score calculation, this should be the main way we normalize the device mix between ISPs, compensating for the cheap ISP => cheap underpowered device correlation.

Gilles moved this task from Inbox to Doing on the Performance-Team board.Nov 19 2018, 9:12 PM
Gilles updated the task description. (Show Details)Nov 20 2018, 4:46 PM
Gilles added a comment.EditedNov 21 2018, 12:02 PM

I've got initial results for France, looking at NavigationTiming data in November. Lower values are better (except for sample size, obviously).

The definition of "TTFB" here is responseStart - connectStart
And the definition of "PLT" is loadEventStart - responseStart

Desktop

ISP/NetworkMedian TTFBMedian PLTMedian CPU benchmark scoreNavtiming sample sizeCPU sample size
Renater94.0732.5102.076181759
Bouygues Telecom145.0933.0140.593602280
SFR150.0912.0133.0182124298
Free155.0978.0141.0174144248
Orange162.0932.0131.0362138686
Free Mobile248.01035.5118.01208274

Mobile

ISP/NetworkMedian TTFBMedian PLTMedian CPU benchmark scoreNavtiming sample sizeCPU sample size
Free151.0707.0341.0193715205
Bouygues Telecom186.0647.0305.0215155720
Orange195.0689.0312.04823612973
SFR203.0661.0318.0317018484
Free Mobile246.0740.0318.0107992807

These results make sense for various reasons:

  • Renater is the super high speed fiber network between universities and research facilities. It makes absolute sense to be at the top of that ranking. It also explains why they have significantly more powerful CPUs than actual ISPs. From personal experience when I was in university, it's always way ahead of general public ISPs in terms of performance. It was so ahead of its time than when I was a student that I destroyed my laptop's hard drive by downloading too much stuff. My PC's hardware was the bottleneck, basically.
  • Orange is the historical operator and has a legal obligation to cover remote areas with older infrastructure. In very rural areas Orange is often the only option, which explain why on Desktop they have slightly worse performance than their competitors.
  • For mobile, Free is an odd one, because it shows up twice. In reality, it shows up three times. For mobile internet access in France, operators are allowed to leverage the Orange network for a fixed price determined by the government. Free makes heavy use of this, and will offload a lot of their mobile traffic to Orange, by having their customers simply connect to Orange cell towers when there's no Free tower around. Additionally, they use a little-known SIM card feature that automatically turns on authenticated logging into Wifi hotspots that belong to subscribers of the fiber/DSL Free offering. Meaning that as you're walking the streets, your phone will automatically connect to WiFi networks of random people who have Free at home. Combined with doing this at your own home if you're also a Free subscriber for your home internet (this is very common as they incentivise this with all-inclusive home+mobile packages), this is probably why "Free" shows up in the ranking, which would suggest the home internet IP range. It's possible that they're also using IPs in all their ranges interchangeably, but I wouldn't be surprised if this difference is explained by the WiFi feature alone. In the end "Free Mobile" has much fewer hits than you might expect given their marketshare, because essentially Free installed very few cell tower capacity (probably close to the legal minimum) and relies heavily on Orange and WiFi. Again, from first-hand experience you *know* as a Free Mobile customer when you've connected to a Free Mobile cell tower, as it's overloaded and slow, and sometimes downright unusable during peak times. I'm not surprised at all that they're the worst in the ranking.

Using this local knowledge to explain the differences seen here confirms to me that the data is sound and that we can extract a ranking from it.

Gilles added a subscriber: Peter.EditedNov 21 2018, 12:25 PM

For Sweden we get:

Desktop

ISP/NetworkMedian TTFBMedian PLTMedian CPU benchmark scoreNavtiming sample sizeCPU sample size
Telenor Norge AS94.0582.0160.0327921
Bahnhof AB95.0555.0110.0203711
A3 Sverige AB101.0605.0113.010798
Telia Company AB105.0622.0136.5635726
Bredband2 AB113.0586.0120.013093
Com Hem AB115.0614.0174.0344128
TELE2128.0689.0184.512564

Mobile

ISP/NetworkMedian TTFBMedian PLTMedian CPU benchmark scoreNavtiming sample sizeCPU sample size
Telenor Norge AS113.0388.5228.0352819
Bahnhof AB116.0351.0312.020456
A3 Sverige AB127.0406.0408.012513
Bredband2 AB131.5379.0322.011608
Com Hem AB134.0381.0185.5457326
Telia Company AB143.0428.0207.51130034
TELE2221.0532.0262.0395633
Telenor Sverige AB235.0486.0827.024884
Hi3G Access AB262.0550.0126.018825

@Peter do these figures make sense? For instance Tele2 having much worse latency than competitors and their customers having slower phones?

This comment was removed by Gilles.
This comment was removed by Gilles.

And now, 'murica:

Desktop

ISP/NetworkMedian TTFBMedian PLTMedian CPU benchmark scoreNavtiming sample sizeCPU sample size
Frontier Communications of America, Inc.96.0702.080.0712812
Comcast Cable Communications, LLC97.0635.0115.082438253
Cablevision Systems Corp.103.0638.0134.51075176
Time Warner Cable Internet LLC112.0661.0123.544199118
MCI Communications Services, Inc. d/b/a Verizon Business112.0627.0114.028003117
AT&T Services, Inc.132.0735.0110.03650479
Cox Communications Inc.132.0693.0122.01691625
Charter Communications140.0721.5137.51586420
CenturyLink Communications, LLC146.0783.0104.01178526

Mobile

ISP/NetworkMedian TTFBMedian PLTMedian CPU benchmark scoreNavtiming sample sizeCPU sample size
Comcast Cable Communications, LLC105.0354.0186.0111695248
Frontier Communications of America, Inc.106.0418.0174.0988917
MCI Communications Services, Inc. d/b/a Verizon Business117.0346.0203.033220106
Cablevision Systems Corp.117.0370.0200.01283568
Time Warner Cable Internet LLC132.0403.0165.062121121
Cox Communications Inc.147.0458.0260.02233221
Charter Communications157.0469.0191.02449425
CenturyLink Communications, LLC164.0518.5161.01633823
AT&T Services, Inc.165.0476.0214.05450085
BRIGHT HOUSE NETWORKS, LLC188.0488.0208.0905915
T-Mobile USA, Inc.221.0632.0205.552025200
Cellco Partnership DBA Verizon Wireless240.0576.0198.07661489
Sprint271.0663.5374.563926
Sprint Personal Communications Systems271.0694.0259.01960425
AT&T Mobility LLC286.0679.0115.05887294

@Imarlier @aaron does this ranking match your expectations of US ISPs?

aaron added a comment.Nov 22 2018, 8:22 AM

It looks sane, though I wonder why Comcast is so high in usage for mobile? Is that mostly from touchpad devices instead of smartphones?

Peter added a comment.Nov 22 2018, 8:38 AM

Interesting, thanks @Gilles for sharing! In Sweden kind the mobile situation is like this: We have main provider Telia Company AB that was previously owned by the public sector, What differs for them than the rest, is that they have support for the whole of Sweden. For example if you are up in the north of Sweden, out in the wilderness that's the only provider that works, so I guess their p-high is higher than the rest, but it's because they are the only one that support far away places. Also (if I remember correctly) Hi3G only support larger cities its interesting that they don't have the best score.

One thing for mobile, I think Bahnhof AB is only a provider for desktop, so can it be that those values are from phones using wifi?

@aaron is Comcast only a home internet provider, not a mobile one? If that's the case, I think the simple explanation is WiFi usage of mobile phones (see more below about that).

@Peter That's what I expected with the historical operator (legal obligation to support remote areas), which will probably be the case in a lot of European countries. However, while probably not lucrative, there's no theoretical barrier to them providing fast, low-latency internet to the outer regions of Sweden :) They just have to lay more fiber. I.e. I think it's fair that they show up lower than others in the ranking, it highlights how rural areas are suffering internet connection quality inequity.

As for Bahnhof, they're definitely people accessing the mobile site on WiFi. I've only split by mobile site/desktop site, not by device type nor network type. Network type would be nice, but it's only supported by Chrome and Opera, which would force us to reduce the dataset only to those users. I think what we can do, though, since we don't have that local knowledge for every country, is record the network type (not the effective type) and use that information to filter out ASNs that correspond to Wifi connections. The issue of Femtocell will remain, but I doubt there's an ISP out there that gives femtocell by default and no WiFi option by default. So we'll probably be able to do something like "if > 80% of requests on that ASN range are WiFi, we can remove it from the mobile dataset".

I'll need to add connection type to NavigationTiming, we only collect effectiveConnectionType at the moment. There's a caveat to it, though, which is "The desktop implementation in Chrome 61 excludes connection type and max bandwidth except for ChromeOS.". It's still true in the current stable version of Desktop Chrome. Navigator.connection as a whole is unavailable on iOS Chrome. Which leaves Chrome on Android (+ less importantly, Firefox Mobile, Android Browser, Samsung Internet and Baidu Browser). I verified that it does work on Android Chrome.

Hopefully there are enough Android Chrome users on each ASN hit for the mobile site for us to make that determination. And if there aren't, it probably means that it's not a mobile provider.

Change 475297 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Collect NetworkInformation connection type

https://gerrit.wikimedia.org/r/475297

aaron added a comment.Nov 22 2018, 7:19 PM

@Gilles: Comcast only has cable infrastructure in terms what the ISP provides itself. For customers with cable, they can also get XFinity Mobile (https://www.tomsguide.com/us/xfinity-mobile-faq,news-25223.html) . That's basically just a bunch of Wi-Fi hotspots build off of Verizon. I don't know how many people are using that and it seems new-ish. Also, the latency figures are quite low, which makes me doubt that it is XFinity Mobile and more likely regular wireless/xfinity.

I'm curious how often people bother activating wifi on their smartphone when it already has a data plan, browsing on their phone even though they presumably have a computer in range (if they are on their own wlan over Comcast Xfinity). Perhaps it's just more than I suspected.

Change 475297 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Collect NetworkInformation connection type

https://gerrit.wikimedia.org/r/475297

As @aaron noted, Comcast is only a home network provider, not mobile, I presume that the rest of that is people on their home wifi with a phone or tablet. That makes sense to me.

The first network in the mobile list that is actually a mobile provider is T-Mobile. Everything above that is a residential high-speed provider (generally cable).

The PLTs for the mobile side really show much faster the mobile frontend is...

In order to allow future work on this based on the CPU benchmark results, I need to expand the scope of the CPU benchmark beyond the perception survey. I think a sub-sampling ratio of NavTiming samples would make sense.

Change 476021 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Run CPU benchmark for a portion of non-survey samples

https://gerrit.wikimedia.org/r/476021

Change 476021 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Run CPU benchmark for a portion of non-survey samples

https://gerrit.wikimedia.org/r/476021

Change 483377 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Set CPU benchmark sampling factor

https://gerrit.wikimedia.org/r/483377

Change 483377 merged by jenkins-bot:
[operations/mediawiki-config@master] Set CPU benchmark sampling factor

https://gerrit.wikimedia.org/r/483377

Mentioned in SAL (#wikimedia-operations) [2019-01-10T10:10:02Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T209857 Run CPU benchmark for a portion of navtiming pageloads (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2019-01-10T10:26:41Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T209857 Run CPU benchmark for a portion of navtiming pageloads (duration: 00m 52s)

Change 483382 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Increase CPU benchmark sampling factor

https://gerrit.wikimedia.org/r/483382

Change 483382 merged by jenkins-bot:
[operations/mediawiki-config@master] Increase CPU benchmark sampling factor

https://gerrit.wikimedia.org/r/483382

Mentioned in SAL (#wikimedia-operations) [2019-01-10T10:59:29Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T209857 Increase CPU benchmark sampling rate (duration: 00m 53s)

Initial results show that this might work, but I need to wait until the extra data has been collected before I can claim success on devising a fair ranking. Right now for January I only have 12-151 samples per ISP for the US for example (which is logical since pretty much only ruwiki was contributing to the dataset so far) and with that amount I'm seeing big variations in median transferSize. The amount of samples to generate the ranking needs to be large enough that the median transferSize is in the same ballpark for all ranked ISPs. If we don't get there with more data, it's a peculiar finding that users might look at bigger articles depending on which ISP they're subscribed to (could be a rural vs urban thing if that's the case).

On staff IRC there was a discussion of rural vs urban. However I think that since mobile is worth focusing on, there's no mapping between IP addresses and city location for mobile. Without asking users to share their location (which we won't do), there's no way to assess whether they're in a rural or urban area. This means that we'll have to stick with national rankings, which shouldn't inform local decisions about which ISP is best, since that can vary greatly based on location, but if an effort is made by an ISP to improve their service, it should surface in our updated rankings month-to-month or year-to-year.

Change 484977 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/docroot@master] Autonomous Systems performance report

https://gerrit.wikimedia.org/r/484977

Gilles added a subscriber: Nuria.EditedJan 17 2019, 8:57 AM

@Nuria my plan is to have the report generated by a monthly cron python script on a stat machine, and have the resulting CSV then git-pushed to the performance/docroot repo (static site where the report will be viewable). Does that sound sane to you? Is there any precedent to doing something like this?

Gilles renamed this task from Create ISP ranking based on RUM data to Create Autonomous Systems ranking based on RUM data.Jan 17 2019, 8:59 AM
Gilles updated the task description. (Show Details)

Change 484994 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Add pycountry to analytics cluster

https://gerrit.wikimedia.org/r/484994

Gilles added a comment.EditedJan 18 2019, 9:21 AM

So I've figured out how I can have a bot push arbitrary file contents to a gerrit change on performance/docroot, but there is ont big caveat: it's not creating a git commit per se, and thus it's not running the git hooks that get jekyll to run... I don't think it's reasonable to have node and all the potentially risky dependencies it's going to pull run on the stat machines merely to run jekyll and generate the change we need.

That being said we can still have a changeset generated by a bot monthly... elsewhere. On a labs machine, perhaps. This will ensure that we don't forget to do it manually once a month. But it can't be tied to the actual data generated by the stat machine.

Perhaps the reports should be hosted on https://analytics.wikimedia.org/datasets/ ? And I'll look into whether I can have Jekyll pull files from there.

The workflow would be:

  • Monthly cron job on a stat machine generates the report and publishes it on datasets
  • Monthly cron job on a labs machine generates and pushes the commit to update the static site, by pulling the data from datasets, running Jekyll locally, and pushing changes to Gerrit

Unrelated to this, this makes me think that we could maybe also use a Gerrit bot to automatically generate a commit that updates the recent blog posts on the perf site, if the Phabricator API has webhook or something for when a blog post is published?

I've created a performance/autonomoussystems folder on /srv/published-datasets on stat1004, which should get picked up and published on https://analytics.wikimedia.org/datasets/ eventually.

As expected, here it is: https://analytics.wikimedia.org/datasets/performance/autonomoussystems/

This will be a fine place to publish the monthly report. I will put a fake one in there (full of zeros) to test the entire pipeline.

Change 484977 merged by jenkins-bot:
[performance/docroot@master] Autonomous Systems performance report

https://gerrit.wikimedia.org/r/484977

Change 486233 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/docroot@master] Use date of remote TSV file as report generation date

https://gerrit.wikimedia.org/r/486233

Change 486233 merged by jenkins-bot:
[performance/docroot@master] Use date of remote TSV file as report generation date

https://gerrit.wikimedia.org/r/486233

Gilles raised the priority of this task from Normal to High.Wed, Feb 27, 10:03 AM
Gilles added a comment.Mon, Mar 4, 3:11 PM

The reporting script is getting very close to being finalised and ready for review. However, generating the numbers for February, I notice that the only 2 countries where there's enough data for the average transferSize to be stable across ISPs are France and Russia. Which have been getting more CPU benchmark runs thanks to the performance survey. Since we haven't had complaints in those countries, I think it's time to run the CPU benchmark more often across the board.

Change 494234 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Increase CPU benchmark sampling factor

https://gerrit.wikimedia.org/r/494234

Change 494234 merged by jenkins-bot:
[operations/mediawiki-config@master] Increase CPU benchmark sampling factor

https://gerrit.wikimedia.org/r/494234

Change 494250 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/asoranking@master] Initial commit

https://gerrit.wikimedia.org/r/494250

Change 494259 had a related patch set uploaded (by Gilles; owner: Gilles):
[integration/config@master] Add CI for new performance/asoranking repo

https://gerrit.wikimedia.org/r/494259

Change 494259 merged by jenkins-bot:
[integration/config@master] Add CI for new performance/asoranking repo

https://gerrit.wikimedia.org/r/494259

Gilles added a comment.Thu, Mar 7, 9:17 AM

@Nuria what do you recommend to deploy this on stat machines? scap?

Is there precedent in Puppet for a cron job that only needs to run on a single stat machine?

Nuria added a subscriber: Ottomata.Thu, Mar 7, 6:22 PM

@Gilles you can deploy to stats machines and later sync the output to the /datasets mount so they available publicy. Cron can live on your user too, having it puppetized just ensuere that if we move machines things do not break, but it is strictly not needed, cc @Ottomata here to double check my advice.

Gilles added a comment.Thu, Mar 7, 7:09 PM

I can live with the cron being under my user. The script is indeed writing to the datasets mount with the --publish option. Then the performance site pulls the tsv from there with a bit and publishes it in a more human-friendly form at https://performance.wikimedia.org/asreport/

Not '/datasets mount', but just the /srv/published-datasets directory.

https://wikitech.wikimedia.org/wiki/Analytics/Ad_hoc_datasets

...right?

Gilles added a comment.Fri, Mar 8, 7:59 AM

Yes, that's where I plan on sending these to. Specifically to /srv/published-datasets/performance/autonomoussystems/

Change 494250 merged by jenkins-bot:
[performance/asoranking@master] Initial commit

https://gerrit.wikimedia.org/r/494250

Change 484994 merged by Ottomata:
[operations/puppet@production] Add pycountry to analytics cluster

https://gerrit.wikimedia.org/r/484994