Page MenuHomePhabricator

Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps
Open, MediumPublic

Description

As we move closer to setting up the new data center in Brazil, we should discuss how we plan to ramp up traffic to the new data center, slowly and progressively to warm the caches, before turning it on for the rest of the continent. This task is meant to be a discussion about that with the aim of making a decision before April.

Currently and as per the geo-maps file in the operations/dns repository, we do not have a defined geo-map for any country in South America, so traffic from there currently goes to codfw (since the September 2023 switchover) or otherwise eqiad. This means whatever entry we add will be under a new country, sudivision, city, or subnet.

For a very high-level overview, traffic to South America as a percentage of our global traffic (30-days):

North America35.95%
Europe33.63%
Asia21.75%
South America4.46%
Africa1.72%
Oceania1.26%
Unknown1.24%
Antarctica0.00%

The current traffic trends from South America indicate that we get the most traffic from Brazil and the least from Paraguay:

Brazil
Argentina
Colombia
Chile
Peru
Venezuela
Ecuador
Bolivia
Uruguay
Paraguay

2024-03-04-091552_1679x905_scrot.png (905×1 px, 268 KB)

There is additionally one more important thing to keep in mind: Brazil is the only Portuguese-speaking country in South America, so there will be variations in user traffic between pt.wikipedia.org vs es.wikipedia.org, which should further affect our decision on how to warm up the site.

Within Brazil, we get the most traffic from:

Unknown
São Paulo
Rio de Janeiro
Brasília
Belo Horizonte
Curitiba
Fortaleza
Porto Alegre
Salvador
Campinas

Ignoring "Unknown" for which we can't do anything, not surprisingly, we get the most traffic from São Paulo and Rio.

2024-03-04-091637_1679x902_scrot.png (902×1 px, 175 KB)

With the above information, we can turn on traffic for one of the cities above for Brazil that is not São Paulo or Rio*, and additionally, one or more Spanish-speaking country from where we don't get a lot of traffic. There has been some discussion on this topic and it is worth mentioning that perhaps we should turn on traffic for all of Brazil before moving anywhere else in the continent. So while that still means figuring out the answers to the above questions, the DNS changes won't happen at the same time.

(* - we can look into subdivision as well, however, the data looks comparable to the city-level data.)

Sometimes, the percantage of traffic helps visualize things better than graphs. For South America, percentage split of traffic (for just a given day but the trend holds true):

Brazil37.81%
Argentina15.54%
Colombia14.44%
Chile9.28%
Peru8.02%
Venezuela6.27%
Ecuador3.85%
Bolivia2.68%
Uruguay1.46%
Paraguay0.65%

Percentage of traffic ''excluding'' Brazil (thus, only Spanish speaking countries) from South America:

Argentina24.99%
Colombia23.22%
Chile14.92%
Peru12.90%
Venezuela10.08%
Ecuador6.20%
Bolivia4.30%
Uruguay2.34%
Paraguay1.05%

Once we have decided this, we can finalize a time as well for rolling out the change. Note that with the lowering of TTL in T140365, traffic will move over in five minutes instead of ten, so the change will be more "sudden", at least relative to how it has been in the past.

To summarize, the decisions we are looking to make are:

  • Should we turn on the site for all of Brazil before moving on to the rest of the continent, or should we do it at the same time and pick Brazil and one other Spanish-speaking country?
  • Regardless of the above, we should figure out which regions in Brazil and which country in South America we should consider for warming up the site.
  • The northern part of South America -- Ecuador, Colombia, Venezuela, Guyana -- should traffic from there continue to go to eqiad/codfw (perhaps codfw) or should we send that traffic to magru as well? (This question might be better answered by actual data and I notice we have RIPE Atlas probes in each of these countries but perhaps some internal data as well?)

Event Timeline

Thanks for the task!

Should we turn on the site for all of Brazil before moving on to the rest of the continent, or should we do it at the same time and pick Brazil and one other Spanish-speaking country?

I think it probably makes sense to do it for Brazil and one of the Spanish-speaking countries. Or specific cities from one of each to start slowly.

Regardless of the above, we should figure out which regions in Brazil and which country in South America we should consider for warming up the site.

Yeah that's a hard one to call. I'm not sure exactly how we can make that decision, there is some guesswork I suppose. Can we just start small/conservative and observe?

The northern part of South America -- Ecuador, Colombia, Venezuela, Guyana -- should traffic from there continue to go to eqiad/codfw (perhaps codfw) or should we send that traffic to magru as well? (This question might be better answered by actual data and I notice we have RIPE Atlas probes in each of these countries but perhaps some internal data as well?)

Yeah I think we need to take a careful look at those. Definitely would be good to check RIPE atlas measures (we are setting one up in magru are we?) and see what the relative latency is. Peru and Ecuador in particular (possibly parts of Columbia) may be better served from the US, given the scarcity of paths across the Andes traffic from those places may end up routing via Florida to Magru, so we need to see if the latency to codfw would actually be better.

ssingh triaged this task as Medium priority.Mar 4 2024, 5:46 PM
ssingh edited projects, added SRE; removed WMF-NDA.
ssingh updated the task description. (Show Details)
ssingh changed the visibility from "Custom Policy" to "Public (No Login Required)".

I'd recommend to start by turning up a small country/region on that continent (Uruguay/Paraguay for example), ideally outside of peak time. That will help warm up the caches nice and slowly and reduce the impact of an issue. Then ramping it up progressively.

While Brazil is the only Portuguese speaking country in the region, the Brazilian speaking users in any neighboring country will also help warm up the cache in that language.

I'd also recommend not splitting up Brazil in sub-regions/cities as that will only add complexity to the setup with no direct benefit.

For the northern south-america countries, we will study them one by one with RIPE data and work done in T332024: GeoIP mapping experiments.

I largely agree with Arzhel's assessment. At a cursory glance, Uruguay or Paraguay look ideal as first candidates.

I don't think it would be bad to temporarily turn on just one sub-region/city of Brazil for testing and warm-up, but I agree it doesn't make sense to run it that way for longer than a day or something.

I peeked at the 'final' intern project results from almost a year ago for all of: Peru, Ecuador, Colombia, Venezuela, Guyana, Suriname, and French Guiana. Generally eqiad was a clear winner for those locations, and generally the rtt latency was 75ms or worse. Usually codfw was close behind. So there is a lot of room here to do better for those locations.

As discussed, very happy to re-use the work of T332024: GeoIP mapping experiments to do actual measurements once the site infrastructure is further along :) This doesn't necessarily have to be a fully ready for production state, some minimal configurations could easily be made workable for mapping.

Change #1025366 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] geo-maps: define initial mapping for South America (magru)

https://gerrit.wikimedia.org/r/1025366