Page MenuHomePhabricator

Define turn-up process and scope for eqsin service to regional countries
Closed, ResolvedPublic

Description

To be defined here:

  • The total geographic target scope of the initial turn-up process at the per-country level without reference to ambiguous region/continent names (but note also temporary restrictions against rollout to certain countries necessitated by Zero in T189250).
  • The process by which we're going to evaluate the state of our peering/transit to customers in the target countries before turning up service for each in our geodns config, and where/how we'll record key dates (geodns initial turnup -> done w/ post-turnup optimizations/fixups) for each country for later impact analysis by Performance-Team and/or Analytics .

EQSIN Target Country List:
In "Asia" (as defined by MaxMind):

CCCountryTurn upStatusReason
BDBangladeshY DonePeering, Transit, Atlas
BNBrunei DarussalamY DoneTests
BTBhutanY DoneAtlas
CCCocos (Keeling) IslandsY DoneManual testing/guessing
CNChinaN-Out of the prefixes tested in our top hit CN ASN, traffic always route through the US
CXChristmas IslandY DoneManual testing/guessing
HKHong KongY DoneAtlas
IDIndonesiaY DoneAtlas
INIndiaY DoneTransit, Peering, Atlas
IOBritish Indian Ocean TerritoryN-Satellite to London area (esams closer)
JPJapanY DonePeering, Transit, Atlas, Tests
KHCambodiaY DonePeering, Tests, Atlas
KPKorea, Democratic People's Republic ofN-Tests (last ASN (china Unicom) lower latency to ulsfo)
KRKorea, Republic ofY DonePeering, Tests
LALao People's Democratic RepublicY DonePeering, Tests
LKSri LankaY DoneAtlas
MMMyanmarY <ZERO>Tests, Peering
MNMongoliaY DoneTests, Atlas (esams a few ms behind eqsin)
MOMacaoY DoneTests, Transit
MVMaldivesY DonePeering
MYMalaysiaY DoneAtlas
NPNepalY DoneAtlas, Peering
PHPhilippinesY DoneAtlas, Peering, Tests
PKPakistanY DonePeering, Transit, Atlas
SGSingaporeY DonePeering, Transit
THThailandY <ZERO>Peering, Transit, Atlas
TLTimor-LesteY <ZERO>Peering, Tests
TWTaiwanY DoneAtlas
VNViet NamY DoneTransit, Atlas

In "Oceania" (as defined by MaxMind):

CCCountryTurn UpStatusReason/Note
ASAmerican SamoaN-Tests (lower latency to ulsfo)
AUAustraliaY DoneT189252#4108532
CKCook IslandsN-Tests (lower latency to ulsfo, via Hawaii)
FJFijiN <ZERO>Tests (lower latency to ulsfo)
FMMicronesia, Federated States ofY Done
GUGuamY DoneTests, variable between eqsin/ulsfo, geographically closer to eqsin
KIKiribatiY DoneTransit
MHMarshall IslandsY DoneTransit, Tests
MPNorthern Mariana IslandsY DoneVia Guam
NCNew CaledoniaY DoneTests
NFNorfolk IslandN-Tests (lower latency to ulsfo)
NRNauruY <ZERO>Tests
NUNiueN-Via NZ
NZNew ZealandN-Tests (lower latency to ulsfo)
PFFrench PolynesiaN-Tests (lower latency to ulsfo)
PGPapua New GuineaN-Tests (lower latency to ulsfo)
PNPitcairn? TBDProbably Satellite
PWPalauY DoneTests, Transit
SBSolomon IslandsN-Tests (lower latency to ulsfo)
TKTokelau? TBDProbably Satellite
TOTongaY <ZERO>Tests
TVTuvaluY DoneTransit, Tests
UMUnited States Minor Outlying IslandsY DoneTransit
VUVanuatuN <ZERO>Tests (lower latency to ulsfo)
WFWallis and FutunaN-tests (lower latency to ulsfo)
WSSamoaN-Tests (lower latency to ulsfo)

Reasons legend:

  • <ZERO> in "Status" column indicates turn-up here is currently blocked by Zero, regardless of network-layer analysis]
  • Atlas: Latency shown by the RIPE Atlas probes matches the distance
  • Peering: Peer with Major ISP in the country (per T186835)
  • Transit: Share transits with Major ISPs in the country
  • Tests: Manually comparing latency and path to eqsin and 2nd clostest DC

Details

Related Gerrit Patches:

Event Timeline

BBlack triaged this task as Medium priority.Mar 8 2018, 9:41 PM
BBlack created this task.
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.
Krinkle added a subscriber: Krinkle.
BBlack updated the task description. (Show Details)Mar 9 2018, 2:54 PM

Updated with actual target country lists above. Process and batching of this for actual turn-up work still TODO :)

See also the peering information tracked in T186835

ayounsi updated the task description. (Show Details)Mar 15 2018, 2:16 AM
ayounsi added a comment.EditedMar 15 2018, 2:25 AM

The methodology is the following:

1/ Identify the countries' main ISPs
For countries with a low number of ISPs, lookup country's ISPs in https://bgp.he.net/country/[country code]

For larger countries use hive (thanks elukey for the help):

hive (wmf)> select isp_data['autonomous_system_number'] as AS, count(1) as hits from webrequest where webrequest_source='text' and year=2018 and month=3 and day=13 and hour=0 and geocoded_data['country_code'] = 'LA' group by isp_data['autonomous_system_number'] order by hits DESC limit 10;

This only process 1h worth of queries, but I assume that in the middle of the day, it's enough to rank ISPs

2/ Look at their peers (or peer's peers) to identify common ones.
3/ Pick an IP out of their prefixes, run mtr -z <IP> from eqsin and ulsfo, analyse the overall latency + AS path
4/ Use https://openipmap.ripe.net/ to locate IP if doubt about its real location

This is without doing any routing changes/preferences. Some tuning will have to be done for optimal routing. Eg:

  • Some transit networks internally route traffic through less optimal path
  • Some networks are satellite based
  • Some traffic prefers one AS PATH, while another seems shorter
  • Found some potential networks to peer with

The most important point is to not degrade latency. For example some pacific islands have better connectivity to the US even though they are geographically closer to Singapore.

Additional resources:

ayounsi updated the task description. (Show Details)Mar 15 2018, 2:47 AM
ayounsi updated the task description. (Show Details)Mar 15 2018, 5:37 AM
ayounsi updated the task description. (Show Details)Mar 15 2018, 5:54 AM
ema moved this task from Triage to Network on the Traffic board.Mar 19 2018, 9:37 AM
ayounsi updated the task description. (Show Details)Mar 20 2018, 5:08 PM
BBlack updated the task description. (Show Details)Mar 20 2018, 5:13 PM
ayounsi updated the task description. (Show Details)Mar 20 2018, 8:00 PM
Krinkle updated the task description. (Show Details)Mar 20 2018, 10:52 PM

(Added a Status column to separate network analysis [Y/N/?] from current status [blocked/planned/done].)

Change 421089 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] geo-maps: mark eqsin+zero issues, split out OC

https://gerrit.wikimedia.org/r/421089

Change 421089 merged by BBlack:
[operations/dns@master] geo-maps: mark eqsin+zero issues, split out OC

https://gerrit.wikimedia.org/r/421089

We've talked about this a bit this week. Basic initial steps of the plan at this point are:

  1. Turn up SG itself today as a limited test, and then probably turn it back off before the weekend since we won't have anyone watching closely at that point.
  2. On Monday, roll into an initial set of 5 countries that look fairly good: SG, MY, ID, VN, NC. All of these appear to have reasonably-good latency to eqsin over ulsfo. MY, ID, and VN are fairly close to SG itself, and NC is a special case as it's a small country over in Oceania with limited ISP options that are known-good.

Change 421361 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] eqsin: turn-up test, SG-only

https://gerrit.wikimedia.org/r/421361

Change 421361 merged by BBlack:
[operations/dns@master] eqsin: turn-up test, SG-only

https://gerrit.wikimedia.org/r/421361

Liuxinyu970226 added a comment.EditedMar 25 2018, 7:17 AM

What about Antarctica (AQ)?

What about Antarctica (AQ)?

Not in scope here. In general, the purpose of the list of above is to define and limit the initial scope. After we're done with this list (including dealing with whatever optimizations/fixups occur during the process) and the Zero issues are more-or-less resolved, we'll consider eqsin a full peer to the other edge sites. From that point forward, it will be open to whatever future changes we see fit in the moment, wherever we see that latency or failover scenarios could be improved by changes.

Change 421855 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] eqsin: turn-up ID, MY, VN, NC

https://gerrit.wikimedia.org/r/421855

Change 421855 merged by BBlack:
[operations/dns@master] eqsin: turn-up ID, MY, VN, NC

https://gerrit.wikimedia.org/r/421855

What about Antarctica (AQ)?

In addition, Antarctica uses Satellite Internet only, I can't find doc on where the base stations (connects the Satellite network to the cabled Internet) are but I'd assume they use GeoIP matching those countries.

ayounsi updated the task description. (Show Details)Mar 28 2018, 12:45 AM

Change 422394 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] eqsin: turn-up HK + PH + JP

https://gerrit.wikimedia.org/r/422394

Change 422395 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] eqsin: turn-up India

https://gerrit.wikimedia.org/r/422395

Change 422396 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] eqsin: turn-up BD, LK, NP, PK

https://gerrit.wikimedia.org/r/422396

Change 422401 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/mediawiki-config@master] wmf-config/InitialiseSettings.php: Enable oversample for additional countries

https://gerrit.wikimedia.org/r/422401

Change 422394 merged by BBlack:
[operations/dns@master] eqsin: turn-up HK + PH + JP

https://gerrit.wikimedia.org/r/422394

Change 422419 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/mediawiki-config@master] wmf-config: Enable oversampling for remaining countries in Asia

https://gerrit.wikimedia.org/r/422419

ayounsi updated the task description. (Show Details)Mar 28 2018, 3:34 PM

Change 422401 merged by jenkins-bot:
[operations/mediawiki-config@master] wmf-config/InitialiseSettings.php: Enable oversample for additional countries

https://gerrit.wikimedia.org/r/422401

Mentioned in SAL (#wikimedia-operations) [2018-03-28T15:37:25Z] <catrope@tin> Synchronized wmf-config/InitialiseSettings.php: Enable oversampling for IN, GU, MP in preparation for eqsin (T189252) (duration: 01m 18s)

ayounsi updated the task description. (Show Details)Mar 28 2018, 4:57 PM

Change 422396 merged by BBlack:
[operations/dns@master] eqsin: turn-up BD, LK, NP, PK

https://gerrit.wikimedia.org/r/422396

Change 422395 merged by BBlack:
[operations/dns@master] eqsin: turn-up India

https://gerrit.wikimedia.org/r/422395

Cwek added a subscriber: Cwek.Mar 29 2018, 12:05 PM
Cwek added a comment.Mar 29 2018, 12:36 PM

Can I ask something?
How to measure where traffic should route through? latency?
I suggest a website to measure the network traffic in China , https://ipip.net , which has a lot of measure point and covers three main ISP in China. Since the major Internet outlets are in Guangzhou, Shanghai, and Beijing, I think that it is sufficient to measure the traffic of point in these three cities or nearby cities to eqsin's data centers.

Can I ask something?
How to measure where traffic should route through? latency?

Presumably by things like RIPE Atlas distributed measures https://blog.wikimedia.org/2014/07/09/how-ripe-atlas-helped-wikipedia-users/

How to measure where traffic should route through? latency?

See legend in the description as well as this comment: T189252#4052304.

ayounsi updated the task description. (Show Details)Mar 29 2018, 8:34 PM

So, intersecting our info at the top ("Y" for eqsin as best site, not zero-blocked), the peering updates in the other private task, and running down Asia (vs Oceania) targets by highest-pop-first, these seem like the next set that's probably all good-to-go to turn up next week (we can revisit Oceania later, esp after sorting out AU carrier issues). None of these appear to me to have major peering issues, but @ayounsi can you confirm this whole list is good-to-go by early next week?

KR (South Korea)
TW (Taiwan)
KH (Cambodia)
LA (Laos)
MN (Mongolia)
BT (Bhutan)
MO (Macau)
BN (Brunei)
MV (Maldives)

If we flip all of these on (can probably be a single batch, as the countries are getting fairly small near the end of that list), I think all that's left uncovered in our initial list of asian countries at the top are things that are zero-blocked or with known suboptimal routing (the special case of CN, and countries that for whatever reason just route better to ulsfo or esams that doesn't seem tractable at present). At that point, we might want to look at moving the maximind virtuals/defaults for AS (Asia continent-level) and/or AP (Asia-Pacific pseudo-country) to eqsin as well to catch other edge-cases.

Change 422419 merged by jenkins-bot:
[operations/mediawiki-config@master] wmf-config: Enable oversampling for remaining countries in Asia

https://gerrit.wikimedia.org/r/422419

Mentioned in SAL (#wikimedia-operations) [2018-03-29T23:47:08Z] <ebernhardson@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: T189252: Enable perf oversampling for remaining countries in Asia (duration: 01m 16s)

[...] but @ayounsi can you confirm this whole list is good-to-go by early next week?

Yup, all good.
Some more (non-representative) RIPE measurement: https://atlas.ripe.net/measurements/11868941/ (No probes in LA or MO).

BBlack updated the task description. (Show Details)Mar 30 2018, 2:08 PM

CC and CX are the last two in the Asia list that are truly-unknown cases (where we're not really even sure what the state is). Any ideas how to get better intel on their situation? Note they're fairly geographically close to Jakarta and thus SG, so map-wise you'd expect eqsin. However, they're external territories of Australia, so they might satellite-link back there first?

Change 423159 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] eqsin: BN, BT, KH, KR, LA, MN, MO, MV, TW

https://gerrit.wikimedia.org/r/423159

Change 423160 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] eqsin: default for AS continent + AP fake-country

https://gerrit.wikimedia.org/r/423160

BBlack added a comment.EditedMar 30 2018, 2:39 PM

Note as a hint (may be irrelevant in practice!) that the cablemap shows a future cable-landing for CX on a line between SG + Perth @ https://www.submarinecablemap.com/#/submarine-cable/australia-singapore-cable-asc , with a claimed future RFS date of July 2018. There's a commercial site for the cabling project at https://australiasingaporecable.com/ . No idea what carriers might actually use it in the future though or how routes will change. Could be interesting to look into who's using this future cable for routing Western Australia to eqsin, too.

BBlack added a comment.EditedMar 30 2018, 3:02 PM

Ok I dug a bit this morning on CX. They basically have one non-satellite broadband provider, and the nic.cx service runs via the same routes (through networks owned by the same). The routes from pretty much anywhere in the world to CX head through Perth first, then show a big latency jump straight out of Perth to reach CX, and the routers for that big jump are also owned by the same company. MTR/ping to these is significantly faster from eqsin bastion than ulsfo. Interesting is that in the MTR to these .CX IPs, the jump from eqsin->perth along the way is very tight (~48ms) and runs over Vocus ( https://www.peeringdb.com/asn/4826 ) which we're picking up at the SG exchange (probably via RS). Vocus is also who's doing that new cable mentioned earlier, I'm sure by no coincidence. This might be a useful hint when we look deeper at AU in general....

CX has the larger population of the two (~1600 vs ~600) and I see some CC references in various whois records related to CX, and they're both geographically-close to each other and administered similarly. It's probably reasonable, to the level of initial care we can take for ~2K pop, to route them both to eqsin for now. Will update tables at the top and add them to the list for Monday for now.

BBlack updated the task description. (Show Details)Mar 30 2018, 3:04 PM

Change 423159 merged by BBlack:
[operations/dns@master] eqsin: BN, BT, CC, CX, KH, KR, LA, MN, MO, MV, TW

https://gerrit.wikimedia.org/r/423159

Change 423160 merged by BBlack:
[operations/dns@master] eqsin: default for AS continent + AP fake-country

https://gerrit.wikimedia.org/r/423160

BBlack updated the task description. (Show Details)Apr 2 2018, 4:10 PM
BBlack updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)Apr 3 2018, 12:21 AM

Change 423680 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] eqsin: turn up FM GU KI MH MP PW TV UM

https://gerrit.wikimedia.org/r/423680

Change 423681 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] eqsin: temporary test for AU

https://gerrit.wikimedia.org/r/423681

Change 423680 merged by BBlack:
[operations/dns@master] eqsin: turn up FM GU KI MH MP PW TV UM

https://gerrit.wikimedia.org/r/423680

Change 423681 merged by BBlack:
[operations/dns@master] eqsin: temporary test for AU

https://gerrit.wikimedia.org/r/423681

Change 423909 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] AU: experiment with splitting WA

https://gerrit.wikimedia.org/r/423909

Change 423909 merged by BBlack:
[operations/dns@master] AU: experiment with splitting WA

https://gerrit.wikimedia.org/r/423909

BBlack added a comment.Apr 5 2018, 1:00 PM

Australia experimental results with current peering arrangements: Over 3x serial 24h periods during the week, observed "tcp" metric (effectively is TCP+TLS Negotation time) under 3 different geodns routing conditions (other metrics follow the same basic pattern):

AU GeoDNS24h avg "tcp" time
ulsfo226ms
eqsin161ms
WA->eqsin, rest->ulsfo213ms

Note that by various manual testing and RIPE probes, we expect that while eqsin is better in the net average, what's actually going on in finer detail is that some AU networks have good routing to eqsin and are dramatically improved by it (~75% reduction in tcp time vs ulsfo), while others do not and get slightly worse (~10-20% increase in tcp time). With the final experiment (splitting Western Australia routing from the rest), we were hoping for some geographic correlation in the connectivity, but that didn't pan out well enough to matter.

For now, I think we're best off leaving it mapped to eqsin as it's the best of the 3 scenarios, but we'll need to continue our efforts to improve our network-level routing to AU going forward.

BBlack updated the task description. (Show Details)Apr 5 2018, 1:03 PM
BBlack closed this task as Resolved.May 15 2018, 4:11 PM

Closing this as well, we're through the basic turn-up process. Trailing work on network engineering and Zero is tracked elsewhere.

Change 446997 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] geo-maps: cleanup ordering for OC

https://gerrit.wikimedia.org/r/446997

Change 446997 merged by BBlack:
[operations/dns@master] geo-maps: cleanup ordering for OC

https://gerrit.wikimedia.org/r/446997