Page MenuHomePhabricator

Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps
Open, MediumPublic

Description

As we move closer to setting up the new data center in Brazil, we should discuss how we plan to ramp up traffic to the new data center, slowly and progressively to warm the caches, before turning it on for the rest of the continent. This task is meant to be a discussion about that with the aim of making a decision before April.

Currently and as per the geo-maps file in the operations/dns repository, we do not have a defined geo-map for any country in South America, so traffic from there currently goes to codfw (since the September 2023 switchover) or otherwise eqiad. This means whatever entry we add will be under a new country, sudivision, city, or subnet.

For a very high-level overview, traffic to South America as a percentage of our global traffic (30-days):

North America35.95%
Europe33.63%
Asia21.75%
South America4.46%
Africa1.72%
Oceania1.26%
Unknown1.24%
Antarctica0.00%

The current traffic trends from South America indicate that we get the most traffic from Brazil and the least from Paraguay:

Brazil
Argentina
Colombia
Chile
Peru
Venezuela
Ecuador
Bolivia
Uruguay
Paraguay

2024-03-04-091552_1679x905_scrot.png (905×1 px, 268 KB)

There is additionally one more important thing to keep in mind: Brazil is the only Portuguese-speaking country in South America, so there will be variations in user traffic between pt.wikipedia.org vs es.wikipedia.org, which should further affect our decision on how to warm up the site.

Within Brazil, we get the most traffic from:

Unknown
São Paulo
Rio de Janeiro
Brasília
Belo Horizonte
Curitiba
Fortaleza
Porto Alegre
Salvador
Campinas

Ignoring "Unknown" for which we can't do anything, not surprisingly, we get the most traffic from São Paulo and Rio.

2024-03-04-091637_1679x902_scrot.png (902×1 px, 175 KB)

With the above information, we can turn on traffic for one of the cities above for Brazil that is not São Paulo or Rio*, and additionally, one or more Spanish-speaking country from where we don't get a lot of traffic. There has been some discussion on this topic and it is worth mentioning that perhaps we should turn on traffic for all of Brazil before moving anywhere else in the continent. So while that still means figuring out the answers to the above questions, the DNS changes won't happen at the same time.

(* - we can look into subdivision as well, however, the data looks comparable to the city-level data.)

Sometimes, the percantage of traffic helps visualize things better than graphs. For South America, percentage split of traffic (for just a given day but the trend holds true):

Brazil37.81%
Argentina15.54%
Colombia14.44%
Chile9.28%
Peru8.02%
Venezuela6.27%
Ecuador3.85%
Bolivia2.68%
Uruguay1.46%
Paraguay0.65%

Percentage of traffic ''excluding'' Brazil (thus, only Spanish speaking countries) from South America:

Argentina24.99%
Colombia23.22%
Chile14.92%
Peru12.90%
Venezuela10.08%
Ecuador6.20%
Bolivia4.30%
Uruguay2.34%
Paraguay1.05%

Once we have decided this, we can finalize a time as well for rolling out the change. Note that with the lowering of TTL in T140365, traffic will move over in five minutes instead of ten, so the change will be more "sudden", at least relative to how it has been in the past.

To summarize, the decisions we are looking to make are:

  • Should we turn on the site for all of Brazil before moving on to the rest of the continent, or should we do it at the same time and pick Brazil and one other Spanish-speaking country?
  • Regardless of the above, we should figure out which regions in Brazil and which country in South America we should consider for warming up the site.
  • The northern part of South America -- Ecuador, Colombia, Venezuela, Guyana -- should traffic from there continue to go to eqiad/codfw (perhaps codfw) or should we send that traffic to magru as well? (This question might be better answered by actual data and I notice we have RIPE Atlas probes in each of these countries but perhaps some internal data as well?)

Event Timeline

Thanks for the task!

Should we turn on the site for all of Brazil before moving on to the rest of the continent, or should we do it at the same time and pick Brazil and one other Spanish-speaking country?

I think it probably makes sense to do it for Brazil and one of the Spanish-speaking countries. Or specific cities from one of each to start slowly.

Regardless of the above, we should figure out which regions in Brazil and which country in South America we should consider for warming up the site.

Yeah that's a hard one to call. I'm not sure exactly how we can make that decision, there is some guesswork I suppose. Can we just start small/conservative and observe?

The northern part of South America -- Ecuador, Colombia, Venezuela, Guyana -- should traffic from there continue to go to eqiad/codfw (perhaps codfw) or should we send that traffic to magru as well? (This question might be better answered by actual data and I notice we have RIPE Atlas probes in each of these countries but perhaps some internal data as well?)

Yeah I think we need to take a careful look at those. Definitely would be good to check RIPE atlas measures (we are setting one up in magru are we?) and see what the relative latency is. Peru and Ecuador in particular (possibly parts of Columbia) may be better served from the US, given the scarcity of paths across the Andes traffic from those places may end up routing via Florida to Magru, so we need to see if the latency to codfw would actually be better.

ssingh triaged this task as Medium priority.Mar 4 2024, 5:46 PM
ssingh edited projects, added SRE; removed WMF-NDA.
ssingh updated the task description. (Show Details)
ssingh changed the visibility from "Custom Policy" to "Public (No Login Required)".

I'd recommend to start by turning up a small country/region on that continent (Uruguay/Paraguay for example), ideally outside of peak time. That will help warm up the caches nice and slowly and reduce the impact of an issue. Then ramping it up progressively.

While Brazil is the only Portuguese speaking country in the region, the Brazilian speaking users in any neighboring country will also help warm up the cache in that language.

I'd also recommend not splitting up Brazil in sub-regions/cities as that will only add complexity to the setup with no direct benefit.

For the northern south-america countries, we will study them one by one with RIPE data and work done in T332024: GeoIP mapping experiments.

I largely agree with Arzhel's assessment. At a cursory glance, Uruguay or Paraguay look ideal as first candidates.

I don't think it would be bad to temporarily turn on just one sub-region/city of Brazil for testing and warm-up, but I agree it doesn't make sense to run it that way for longer than a day or something.

I peeked at the 'final' intern project results from almost a year ago for all of: Peru, Ecuador, Colombia, Venezuela, Guyana, Suriname, and French Guiana. Generally eqiad was a clear winner for those locations, and generally the rtt latency was 75ms or worse. Usually codfw was close behind. So there is a lot of room here to do better for those locations.

As discussed, very happy to re-use the work of T332024: GeoIP mapping experiments to do actual measurements once the site infrastructure is further along :) This doesn't necessarily have to be a fully ready for production state, some minimal configurations could easily be made workable for mapping.

Change #1025366 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] geo-maps: define initial mapping for South America (magru)

https://gerrit.wikimedia.org/r/1025366

magru is a clear win for:
UY, CL, AR, BR, PY

It's better for some but not all users in:
BO, PE

Initial user-measured magru latency as violin plots, per country, Latin/South America (6×755 px, 421 KB)

Oh, and I think magru is a win for SV as well.

Edit: don't listen to me, I confused the magru and codfw plots.

Adding the 3rd transit link in magru greatly improved the latency for many users in Argentina.

The transit link went live midway through Monday the 13th.

Here's a comparison between Argentina user latency as measured on the days of Sunday the 12th and Tuesday the 14th. It plots eqiad (as a control) and magru (the experiment). The sample sizes for each day are the same order of magnitude (2877 for Sunday and 2348 for Tuesday).

image.png (490×553 px, 47 KB)

Adding the 3rd transit link in magru greatly improved the latency for many users in Argentina.

Fascinating to see the affect of this in the stats! Will advise when the IXP is connected in Sao Paolo.

The 3rd transit was also of great help to Chile, and probably Peru (although sample size there is a bit small).

image.png (1×508 px, 100 KB)

Latest results: magru is a clear win for BR, AR, CL, PY, UY, BO

This adds BO to the "clear win" set. I am guessing this is another consequence of the 3rd transit link.

I think that's an improvement for approx 3% of overall global pageviews, and for about 52% of the pageviews in Central+South America.

F53633438

user-measured latency towards all datacenters from Central/South America, data 2024-05-14 -- 2024-05-17 (8×478 px, 637 KB)

We're still waiting on IX.br for final results.

Results after adding BR.ix are in.

The set of countries that magru improves hasn't changed:
BR, CL, AR, UY, PY, BO

PE has magru better for many but not nearly all users -- there are at least a two of the largest ISPs where magru is now better than eqiad -- but given the big increase in the 75%ile we can't do it IMO.

For the rest of the countries looked at, magru is strictly worse than other options.

plot at F53633438

Change #1025366 merged by Ssingh:

[operations/dns@master] geo-maps: define initial mapping for South America (magru)

https://gerrit.wikimedia.org/r/1025366

Change #1052144 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] geo-maps: send BR (Brazil) to magru

https://gerrit.wikimedia.org/r/1052144

Change #1052144 merged by Ssingh:

[operations/dns@master] geo-maps: send BR (Brazil) to magru

https://gerrit.wikimedia.org/r/1052144

Change #1100084 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/dns@master] Enable new countries for magru (Cohort 3)

https://gerrit.wikimedia.org/r/1100084

Change #1100084 merged by Fabfur:

[operations/dns@master] Enable new countries for magru (Cohort 3)

https://gerrit.wikimedia.org/r/1100084

Argentina, Chile and Uruguay now lands on magru by default