Create ISP ranking based on RUM data
Open, NormalPublic

Description

Such ranking should be fair to the ISPs and compensate for confounding factors, like the fact that an ISP that offers cheaper plans would correlate with users that have cheaper, less-performing devices.

We need to come up with a score that represents the ISP's responsibility in the performance mix, eliminating our own (code, physical distance to DC) and the device's (effective processing power).

Gilles created this task.Nov 19 2018, 4:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 19 2018, 4:01 PM
Gilles triaged this task as Normal priority.Nov 19 2018, 4:01 PM
Gilles added a comment.EditedNov 19 2018, 4:15 PM

Given the data we have, I think we can use:

MaxMind to map IPs to ASNs

I'll start with France as an initial case study, as it's the market I know best and will be able to check how ASNs map to real ISPs, see what happens with MVNOs and also compare the results to the ISPs' existing reputation.

Without any change to data retention, a "live" model will therefore be limited to 90 days of data. I think that's fine, as a monthly or quarterly update of the ranking would be great. We certainly don't need to update it more frequently.

MaxMind to map IPs to location

This should allow us to approximate the physical distance between the client and the nearest Wikimedia datacenter. Which, when combined with the speed of light, gives us the absolute minimum achievable in a perfect world. We can deduct that (times the minimum amount of roundtrips) from the effective TTFB, to compensate for our own responsibility for where we've physically located our edge caches.

Ideally we would look at the distance to the DC the user was actually routed to, as the DNS georouting is also our own responsibility.

TTFB

In the form of responseStart - connectStart. This would limit our own client-side code's responsibility in potential slowdowns, by focusing only on the first roundtrips.

Effective connection type

The effective connection type ratios should be quite representative of the ISP's infrastructure.

Device memory

This should allow us to get a sense of how overloaded a particular device might be, which isn't the ISP's fault. If practical, it could be used to simply filter out devices where the available memory is low.

CPU benchmark

Either through blacklisting, bucketizing or factoring in percentiles into the score calculation, this should be the main way we normalize the device mix between ISPs, compensating for the cheap ISP => cheap underpowered device correlation.

Gilles moved this task from Inbox to Doing on the Performance-Team board.Nov 19 2018, 9:12 PM
Gilles updated the task description. (Show Details)Nov 20 2018, 4:46 PM
Gilles added a comment.EditedNov 21 2018, 12:02 PM

I've got initial results for France, looking at NavigationTiming data in November. Lower values are better (except for sample size, obviously).

The definition of "TTFB" here is responseStart - connectStart
And the definition of "PLT" is loadEventStart - responseStart

Desktop

ISP/NetworkMedian TTFBMedian PLTMedian CPU benchmark scoreNavtiming sample sizeCPU sample size
Renater94.0732.5102.076181759
Bouygues Telecom145.0933.0140.593602280
SFR150.0912.0133.0182124298
Free155.0978.0141.0174144248
Orange162.0932.0131.0362138686
Free Mobile248.01035.5118.01208274

Mobile

ISP/NetworkMedian TTFBMedian PLTMedian CPU benchmark scoreNavtiming sample sizeCPU sample size
Free151.0707.0341.0193715205
Bouygues Telecom186.0647.0305.0215155720
Orange195.0689.0312.04823612973
SFR203.0661.0318.0317018484
Free Mobile246.0740.0318.0107992807

These results make sense for various reasons:

  • Renater is the super high speed fiber network between universities and research facilities. It makes absolute sense to be at the top of that ranking. It also explains why they have significantly more powerful CPUs than actual ISPs. From personal experience when I was in university, it's always way ahead of general public ISPs in terms of performance. It was so ahead of its time than when I was a student that I destroyed my laptop's hard drive by downloading too much stuff. My PC's hardware was the bottleneck, basically.
  • Orange is the historical operator and has a legal obligation to cover remote areas with older infrastructure. In very rural areas Orange is often the only option, which explain why on Desktop they have slightly worse performance than their competitors.
  • For mobile, Free is an odd one, because it shows up twice. In reality, it shows up three times. For mobile internet access in France, operators are allowed to leverage the Orange network for a fixed price determined by the government. Free makes heavy use of this, and will offload a lot of their mobile traffic to Orange, by having their customers simply connect to Orange cell towers when there's no Free tower around. Additionally, they use a little-known SIM card feature that automatically turns on authenticated logging into Wifi hotspots that belong to subscribers of the fiber/DSL Free offering. Meaning that as you're walking the streets, your phone will automatically connect to WiFi networks of random people who have Free at home. Combined with doing this at your own home if you're also a Free subscriber for your home internet (this is very common as they incentivise this with all-inclusive home+mobile packages), this is probably why "Free" shows up in the ranking, which would suggest the home internet IP range. It's possible that they're also using IPs in all their ranges interchangeably, but I wouldn't be surprised if this difference is explained by the WiFi feature alone. In the end "Free Mobile" has much fewer hits than you might expect given their marketshare, because essentially Free installed very few cell tower capacity (probably close to the legal minimum) and relies heavily on Orange and WiFi. Again, from first-hand experience you *know* as a Free Mobile customer when you've connected to a Free Mobile cell tower, as it's overloaded and slow, and sometimes downright unusable during peak times. I'm not surprised at all that they're the worst in the ranking.

Using this local knowledge to explain the differences seen here confirms to me that the data is sound and that we can extract a ranking from it.

Gilles added a subscriber: Peter.EditedNov 21 2018, 12:25 PM

For Sweden we get:

Desktop

ISP/NetworkMedian TTFBMedian PLTMedian CPU benchmark scoreNavtiming sample sizeCPU sample size
Telenor Norge AS94.0582.0160.0327921
Bahnhof AB95.0555.0110.0203711
A3 Sverige AB101.0605.0113.010798
Telia Company AB105.0622.0136.5635726
Bredband2 AB113.0586.0120.013093
Com Hem AB115.0614.0174.0344128
TELE2128.0689.0184.512564

Mobile

ISP/NetworkMedian TTFBMedian PLTMedian CPU benchmark scoreNavtiming sample sizeCPU sample size
Telenor Norge AS113.0388.5228.0352819
Bahnhof AB116.0351.0312.020456
A3 Sverige AB127.0406.0408.012513
Bredband2 AB131.5379.0322.011608
Com Hem AB134.0381.0185.5457326
Telia Company AB143.0428.0207.51130034
TELE2221.0532.0262.0395633
Telenor Sverige AB235.0486.0827.024884
Hi3G Access AB262.0550.0126.018825

@Peter do these figures make sense? For instance Tele2 having much worse latency than competitors and their customers having slower phones?

This comment was removed by Gilles.
This comment was removed by Gilles.

And now, 'murica:

Desktop

ISP/NetworkMedian TTFBMedian PLTMedian CPU benchmark scoreNavtiming sample sizeCPU sample size
Frontier Communications of America, Inc.96.0702.080.0712812
Comcast Cable Communications, LLC97.0635.0115.082438253
Cablevision Systems Corp.103.0638.0134.51075176
Time Warner Cable Internet LLC112.0661.0123.544199118
MCI Communications Services, Inc. d/b/a Verizon Business112.0627.0114.028003117
AT&T Services, Inc.132.0735.0110.03650479
Cox Communications Inc.132.0693.0122.01691625
Charter Communications140.0721.5137.51586420
CenturyLink Communications, LLC146.0783.0104.01178526

Mobile

ISP/NetworkMedian TTFBMedian PLTMedian CPU benchmark scoreNavtiming sample sizeCPU sample size
Comcast Cable Communications, LLC105.0354.0186.0111695248
Frontier Communications of America, Inc.106.0418.0174.0988917
MCI Communications Services, Inc. d/b/a Verizon Business117.0346.0203.033220106
Cablevision Systems Corp.117.0370.0200.01283568
Time Warner Cable Internet LLC132.0403.0165.062121121
Cox Communications Inc.147.0458.0260.02233221
Charter Communications157.0469.0191.02449425
CenturyLink Communications, LLC164.0518.5161.01633823
AT&T Services, Inc.165.0476.0214.05450085
BRIGHT HOUSE NETWORKS, LLC188.0488.0208.0905915
T-Mobile USA, Inc.221.0632.0205.552025200
Cellco Partnership DBA Verizon Wireless240.0576.0198.07661489
Sprint271.0663.5374.563926
Sprint Personal Communications Systems271.0694.0259.01960425
AT&T Mobility LLC286.0679.0115.05887294

@Imarlier @aaron does this ranking match your expectations of US ISPs?

aaron added a comment.Nov 22 2018, 8:22 AM

It looks sane, though I wonder why Comcast is so high in usage for mobile? Is that mostly from touchpad devices instead of smartphones?

Peter added a comment.Nov 22 2018, 8:38 AM

Interesting, thanks @Gilles for sharing! In Sweden kind the mobile situation is like this: We have main provider Telia Company AB that was previously owned by the public sector, What differs for them than the rest, is that they have support for the whole of Sweden. For example if you are up in the north of Sweden, out in the wilderness that's the only provider that works, so I guess their p-high is higher than the rest, but it's because they are the only one that support far away places. Also (if I remember correctly) Hi3G only support larger cities its interesting that they don't have the best score.

One thing for mobile, I think Bahnhof AB is only a provider for desktop, so can it be that those values are from phones using wifi?

@aaron is Comcast only a home internet provider, not a mobile one? If that's the case, I think the simple explanation is WiFi usage of mobile phones (see more below about that).

@Peter That's what I expected with the historical operator (legal obligation to support remote areas), which will probably be the case in a lot of European countries. However, while probably not lucrative, there's no theoretical barrier to them providing fast, low-latency internet to the outer regions of Sweden :) They just have to lay more fiber. I.e. I think it's fair that they show up lower than others in the ranking, it highlights how rural areas are suffering internet connection quality inequity.

As for Bahnhof, they're definitely people accessing the mobile site on WiFi. I've only split by mobile site/desktop site, not by device type nor network type. Network type would be nice, but it's only supported by Chrome and Opera, which would force us to reduce the dataset only to those users. I think what we can do, though, since we don't have that local knowledge for every country, is record the network type (not the effective type) and use that information to filter out ASNs that correspond to Wifi connections. The issue of Femtocell will remain, but I doubt there's an ISP out there that gives femtocell by default and no WiFi option by default. So we'll probably be able to do something like "if > 80% of requests on that ASN range are WiFi, we can remove it from the mobile dataset".

I'll need to add connection type to NavigationTiming, we only collect effectiveConnectionType at the moment. There's a caveat to it, though, which is "The desktop implementation in Chrome 61 excludes connection type and max bandwidth except for ChromeOS.". It's still true in the current stable version of Desktop Chrome. Navigator.connection as a whole is unavailable on iOS Chrome. Which leaves Chrome on Android (+ less importantly, Firefox Mobile, Android Browser, Samsung Internet and Baidu Browser). I verified that it does work on Android Chrome.

Hopefully there are enough Android Chrome users on each ASN hit for the mobile site for us to make that determination. And if there aren't, it probably means that it's not a mobile provider.

Change 475297 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Collect NetworkInformation connection type

https://gerrit.wikimedia.org/r/475297

aaron added a comment.Nov 22 2018, 7:19 PM

@Gilles: Comcast only has cable infrastructure in terms what the ISP provides itself. For customers with cable, they can also get XFinity Mobile (https://www.tomsguide.com/us/xfinity-mobile-faq,news-25223.html) . That's basically just a bunch of Wi-Fi hotspots build off of Verizon. I don't know how many people are using that and it seems new-ish. Also, the latency figures are quite low, which makes me doubt that it is XFinity Mobile and more likely regular wireless/xfinity.

I'm curious how often people bother activating wifi on their smartphone when it already has a data plan, browsing on their phone even though they presumably have a computer in range (if they are on their own wlan over Comcast Xfinity). Perhaps it's just more than I suspected.

Change 475297 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Collect NetworkInformation connection type

https://gerrit.wikimedia.org/r/475297

As @aaron noted, Comcast is only a home network provider, not mobile, I presume that the rest of that is people on their home wifi with a phone or tablet. That makes sense to me.

The first network in the mobile list that is actually a mobile provider is T-Mobile. Everything above that is a residential high-speed provider (generally cable).

The PLTs for the mobile side really show much faster the mobile frontend is...

In order to allow future work on this based on the CPU benchmark results, I need to expand the scope of the CPU benchmark beyond the perception survey. I think a sub-sampling ratio of NavTiming samples would make sense.

Change 476021 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Run CPU benchmark for a portion of non-survey samples

https://gerrit.wikimedia.org/r/476021

Change 476021 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Run CPU benchmark for a portion of non-survey samples

https://gerrit.wikimedia.org/r/476021

Change 483377 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Set CPU benchmark sampling factor

https://gerrit.wikimedia.org/r/483377

Change 483377 merged by jenkins-bot:
[operations/mediawiki-config@master] Set CPU benchmark sampling factor

https://gerrit.wikimedia.org/r/483377

Mentioned in SAL (#wikimedia-operations) [2019-01-10T10:10:02Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T209857 Run CPU benchmark for a portion of navtiming pageloads (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2019-01-10T10:26:41Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T209857 Run CPU benchmark for a portion of navtiming pageloads (duration: 00m 52s)

Change 483382 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Increase CPU benchmark sampling factor

https://gerrit.wikimedia.org/r/483382

Change 483382 merged by jenkins-bot:
[operations/mediawiki-config@master] Increase CPU benchmark sampling factor

https://gerrit.wikimedia.org/r/483382

Mentioned in SAL (#wikimedia-operations) [2019-01-10T10:59:29Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T209857 Increase CPU benchmark sampling rate (duration: 00m 53s)

Initial results show that this might work, but I need to wait until the extra data has been collected before I can claim success on devising a fair ranking. Right now for January I only have 12-151 samples per ISP for the US for example (which is logical since pretty much only ruwiki was contributing to the dataset so far) and with that amount I'm seeing big variations in median transferSize. The amount of samples to generate the ranking needs to be large enough that the median transferSize is in the same ballpark for all ranked ISPs. If we don't get there with more data, it's a peculiar finding that users might look at bigger articles depending on which ISP they're subscribed to (could be a rural vs urban thing if that's the case).

On staff IRC there was a discussion of rural vs urban. However I think that since mobile is worth focusing on, there's no mapping between IP addresses and city location for mobile. Without asking users to share their location (which we won't do), there's no way to assess whether they're in a rural or urban area. This means that we'll have to stick with national rankings, which shouldn't inform local decisions about which ISP is best, since that can vary greatly based on location, but if an effort is made by an ISP to improve their service, it should surface in our updated rankings month-to-month or year-to-year.