Page MenuHomePhabricator

Craft geo-maps file to create lowest-latency routes from south america
Open, LowPublic

Assigned To
None
Authored By
BCornwall
Apr 29 2024, 3:41 PM
Referenced Files
F53600438: image.png
May 17 2024, 1:28 PM
F53533701: image.png
May 16 2024, 10:22 PM
F50001873: image.png
May 3 2024, 4:58 PM
F50001634: image.png
May 3 2024, 4:58 PM

Description

Now that we have a new south american data center (magru), we need to revisit the latencies for each of the countries on that continent: Is each country better served by magru or by e.g. ulsfo?

https://gitlab.wikimedia.org/repos/sre/pop-latency-measurement was developed for this purpose, so use that. I also hear murmurings of https://gitlab.wikimedia.org/repos/sre/probenet but I've not used it and the docs could be improved somewhat.

Event Timeline

magru is a clear win for:
UY, CL, AR, BR, PY

It's better for some but not all users in:
BO, PE

Initial user-measured magru latency as violin plots, per country, Latin/South America (6×755 px, 421 KB)

We could choose to use subdivision-level mapping in cases where it makes sense.

Unfortunately subdivision-level mapping didn't help in PE -- there are many regions where magru is both better and worse than eqiad.

And over half our data points so far are in one region:

image.png (547×990 px, 72 KB)

Subdivision data helps for two (perhaps three) departments in Bolivia, although not for others:

image.png (1×1 px, 146 KB)

For what it's worth, when I was setting up the São Paulo cache with Oracle Cloud for Inkbunny, I found some ISPs used undersea cables along the western seaboard to NA rather than running through the jungle to SP; the catchment I came up by pinging addresses in IPs listed as being in certain towns from different ISPs was (2,-27),(-90,-75) lat/long, i.e. (2,-75) as the top-left coordinate.

As you found Peru ends up being split right through the middle at -75 longitude (while Ecuador and all the states along the north edge are out). In this situation I was free to choose São Paulo even when it was equal with NA because it wouldn't go over allocated transfer either way, but if you want it to be strictly for results which are unambiguously better you might draw the line more tightly.

@GreenReaper thanks so much for the helpful contribution :) I'll see if I can reproduce your results.

Adding the 3rd transit link in magru greatly improved the latency for many users in Argentina.

The transit link went live midway through Monday the 13th.

Here's a comparison between Argentina user latency as measured on the days of Sunday the 12th and Tuesday the 14th. It plots eqiad (as a control) and magru (the experiment). The sample sizes for each day are the same order of magnitude (2877 for Sunday and 2348 for Tuesday).

image.png (490×553 px, 47 KB)

The 3rd transit was also of great help to Chile, and probably Peru (although sample size there is a bit small).

image.png (1×508 px, 100 KB)

Latest results: magru is a clear win for BR, AR, CL, PY, UY, BO

This adds BO to the "clear win" set. I am guessing this is another consequence of the 3rd transit link.

I think that's an improvement for approx 3% of overall global pageviews, and for about 52% of the pageviews in Central+South America.

F53633438

user-measured latency towards all datacenters from Central/South America, data 2024-05-14 -- 2024-05-17 (8×478 px, 637 KB)

We're still waiting on IX.br for final results.

Results after adding BR.ix are in.

The set of countries that magru improves hasn't changed:
BR, CL, AR, UY, PY, BO

PE has magru better for many but not nearly all users -- there are at least a two of the largest ISPs where magru is now better than eqiad -- but given the big increase in the 75%ile we can't do it IMO.

For the rest of the countries looked at, magru is strictly worse than other options.

plot at F53633438

Thanks for the great analysis, Chris!