Page MenuHomePhabricator

Maps hardware planning for FY16/17
Closed, ResolvedPublic

Description

We need to figure out what to request. My current thoughts are:

  • HTTP caching infrastructure, in case we're not getting it this year. Per Brandon, this would be 4 varnishes × 4 datacenters = 16 boxes, standard Varnish spec.
  • 4 tile servers in eqiad, to match what we have right now in codfw. Spec needs polishing.

Event Timeline

MaxSem created this task.Jan 28 2016, 6:57 PM
MaxSem raised the priority of this task from to Needs Triage.
MaxSem updated the task description. (Show Details)
MaxSem added projects: Maps-Sprint, Operations.
MaxSem added subscribers: MaxSem, Yurik, EBernhardson and 3 others.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 28 2016, 6:57 PM
BBlack added a subscriber: mark.Jan 28 2016, 9:01 PM

Are the 4x tile servers in codfw "maps-test200x"? Are those being renamed / reused as production?

@BBlack, the 4 test servers have been performing admirably, so if possible, it would be good to keep them as production and match them in another DC for redundancy.

This will likely fit under the strategic budget so we'll need a brief narrative about going default on Wikipedia and any other projects to explain the increase of machines.

akosiaris added a comment.EditedJan 29 2016, 10:35 AM

Just making sure of something. Currently the maps cache cluster is 2 boxes and is performing quite well (with minimal load). 4 does not sound bad to me, but do we have any numbers that justifies 4 or even suggests 4?

The 4x4 varnishes + 4x2 backends was initially suggested by @BBlack as the minimal platform to serve all of Wikipedias. I tried stress-testing maps from multiple labs instances, but the results were inconclusive as far as the maximum number of clients. When enwiki geohack was enabled for a few days, we were handling 250k unique IP+user-agent combinations per day, 3.6m requests for tiles (4m all requests), or about 50 requests per second, about 2.9 "unique" users per second. Varnish and backend servers showed insignificant CPU load during that time.

All metrics are at https://wikitech.wikimedia.org/wiki/Maps#Monitoring

I don't think I've ever made recommendations about the backend service, just the 4x4 cache/termination layer. That part isn't really a "suggestion", it's an operational minimum if we're going to support this as a production service (which it still isn't as far as I'm officially aware. We're not monitoring it, and it doesn't have reliable infrastructure to our normal standards). This was talked about way back in T109162#1542421 .

OK, that answers my question. Thanks!

Yurik added a comment.Jan 29 2016, 1:21 PM

@BBlack, thanks for the link. The purpose of this task is exactly that - to have enough hardware for this service to gain a full production status.

Yurik added a comment.Feb 2 2016, 9:47 PM

Ok, seems some confusion has been clarified. The maps team immediate need to launch maps for all wiki projects is 16 varnish servers (4 per cluster), and 8 backend servers (4 in two of the clusters). The backend machines need 1+ TB SSD harddrives. The backend machines we have been using have 96GB 12 cores, and this has been working out very well. @mark, please comment regarding the state of Discovery hardware funds for this fiscal year, and if we need to allocate anything for the next year. Other than the above hardware requirements, I think the Maps project should have enough hardware to satisfy our needs for the next year. Thanks!

mark added a comment.Feb 3 2016, 3:56 PM

Are we able to use all backends across both eqiad/codfw simultaneously? Perhaps 2x 3 backend machines would be reasonable in that case?

Yurik added a project: Maps.Feb 3 2016, 7:55 PM
Yurik set Security to None.
Restricted Application added a project: Discovery. · View Herald TranscriptFeb 3 2016, 7:55 PM
EBernhardson added a comment.EditedFeb 3 2016, 8:24 PM

Talked with yuri and max about this today. Yuri is going to try to get some numbers around how many tiles we can generate with the current hardware. The varnishes, having previously served the entirety of mobile traffic, are likely plenty to handle any increase in maps usage over the next 18 months, but the actual tile rendering we don't really have hard numbers on. To choose between 3x or 4x servers per datacenter is going to depend on what kind of sustained rendering load they can handle. Unfortunatly even after that, the biggest unknown is cache hit rate. We can only wildly guess at what the hit rate will look like. Even still i think we can come up with a worst case hit rate that is within reason and extrapolate from there how many users we can serve maps to based on what yuri finds in terms of sustained rendering throughput. We should also keep in mind the aim to have enough hardware for one datacenter to go offline and have the remaining datacenters handle the load.

In terms of budgeting, I'm under the impression that we wouldn't need to just add to one DC, but also replace the existing nodes? My understanding is that those were spare hardware that was not fit for production usage for some reason. Even if this is the case, it looks like 6 or 8 nodes will fit within the FY15-16 budget for "Search & Query infrastructure". We just need some more solid numbers to decide between 6 or 8.

Yurik added a comment.EditedFeb 4 2016, 12:29 AM

Benchmarking maps2003 (slower 8-core, 64GB) handles 88 tiles/second, and maps2004 (faster 12core, 96GB) handles 131 tiles/second. Assuming we get the 12core ones for backend, 8 backend servers should get us about 1000 tiles/second, so if we have 90% hit rate, the cluster would handle 10K requests/second.

Analyzing ganglia graphs above, it appears that CPU was the bounding factor in these tests, so it might be possible to increase performance by separating storage (Cassandra & Postgres) from cpu-bound rendering, and use more inexpensive CPU boxes as renderers. On the other hand, our hope is to switch to the client-rendered map, which means we will serve data directly from Cassandra without processing it first. Lastly, @akosiaris performed some benchmarks showing that switching Cassandra to Postgres for vector tile storage would gain almost 10x performance increase, and I suspect that most of it will come from CPU. So in short, I propose we continue with the undivided cluster (each backend machine performing all functions) until we have more significant traffic and better understanding of performance bottlenecks.

Deskana moved this task from Needs triage to Maps on the Discovery board.Feb 4 2016, 6:14 AM
Yurik moved this task from All map-related tasks to Kartotherian on the Maps board.Feb 7 2016, 9:57 PM
mark added a comment.Feb 8 2016, 2:57 PM

The varnishes, having previously served the entirety of mobile traffic, are likely plenty to handle any increase in maps usage over the next 18 months, but the actual tile rendering we don't really have hard numbers on.

Agreed, but we also can't simply assume that those existing Varnish servers are "free" and don't need to be budgeted for in some way. They are then no longer available as capacity for the (now combined) text + mobile cache clusters, and a portion of them will also need to be replaced in the next fiscal year, and the FY after that. Over 4-5 years, all Varnish caches used everywhere for maps need refresh. A portion of that should be part of the Maps (strategic?) budget for next FY, I think.

To choose between 3x or 4x servers per datacenter is going to depend on what kind of sustained rendering load they can handle. Unfortunatly even after that, the biggest unknown is cache hit rate. We can only wildly guess at what the hit rate will look like. Even still i think we can come up with a worst case hit rate that is within reason and extrapolate from there how many users we can serve maps to based on what yuri finds in terms of sustained rendering throughput. We should also keep in mind the aim to have enough hardware for one datacenter to go offline and have the remaining datacenters handle the load.

Indeed. Ideally a single data center is (just) able to handle normal complete load, and in the normal situation with 2 data centers we can handle load comfortably as well as any spikes.

In terms of budgeting, I'm under the impression that we wouldn't need to just add to one DC, but also replace the existing nodes? My understanding is that those were spare hardware that was not fit for production usage for some reason. Even if this is the case, it looks like 6 or 8 nodes will fit within the FY15-16 budget for "Search & Query infrastructure". We just need some more solid numbers to decide between 6 or 8.

They were repurposed, yes. I'll gather the data on them to see if/when they need refresh.

mark added a comment.Feb 9 2016, 3:32 PM

The varnishes, having previously served the entirety of mobile traffic, are likely plenty to handle any increase in maps usage over the next 18 months, but the actual tile rendering we don't really have hard numbers on.

Agreed, but we also can't simply assume that those existing Varnish servers are "free" and don't need to be budgeted for in some way. They are then no longer available as capacity for the (now combined) text + mobile cache clusters, and a portion of them will also need to be replaced in the next fiscal year, and the FY after that. Over 4-5 years, all Varnish caches used everywhere for maps need refresh. A portion of that should be part of the Maps (strategic?) budget for next FY, I think.

I think it's reasonable to say that over a 4 year cycle, we'll maintain 4x 4 Varnish cache boxes to support Maps caching traffic. Very roughly speaking, we'd be spending one 4th of that each year for refresh.

For the next FY that is actually reality: 4 of the Varnish servers currently being considered for repurposing for maps are due for refresh next FY. So I think the strategic budget request for Maps for next FY should include the cost of 4 Varnish caches.

In terms of budgeting, I'm under the impression that we wouldn't need to just add to one DC, but also replace the existing nodes? My understanding is that those were spare hardware that was not fit for production usage for some reason. Even if this is the case, it looks like 6 or 8 nodes will fit within the FY15-16 budget for "Search & Query infrastructure". We just need some more solid numbers to decide between 6 or 8.

They were repurposed, yes. I'll gather the data on them to see if/when they need refresh.

I looked up their data, and they are all approximately 4 years old; fine for this test, but they will indeed need refresh soon for (continued) production use. So let's assume buy all required backends new, either the remainder of this FY or next FY.

So it looks like the next steps are

  1. Include 4 backend boxes and 4 varnish boxes in the budget ask for strategic goals, FY16-17
  2. Acquire 4 backend boxes box for eqiad from search and query budget, FY15-16

I'm not sure what timeline we should put on the 4 backend boxes for FY15-16. Since the budget is already in place we could start working on this relatively soon (as in, create a ticket and put it in the procurement queue). I'm wondering if there is any reason to wait until the FY16-17 strategic budget is accepted though, or if we should go ahead and get these boxes soon. I don't think we need the results of the stategic budget to move ahead, but looking for input.

We are completing 15 years of Wikipedia. Maps in Wikipedia needs an update .The effort required for creating static maps for many projects is tiresome. For example, I come from India, there are 1000s of articles which require maps. Maps help to under the subject much easier.

Usually i work on the articles on WP:HWY. In this project, all the articles needs a map.

Just wondering how come budget is becoming a problem for this critical project.

Gehel added a subscriber: Gehel.Feb 18 2016, 1:17 PM

To answer @Naveenpf, to the best of my knowledge, there is no budget issue in the sense of "we do not have enough money for this, let's kill the project". There is a budgeting process that needs to be followed, we need to understand how much budget we need and where it comes from.

I'm fairly new here, so I might have misunderstood a few things. The only one I'm fairly certain is that we all understand that maps are important and that @Yurik will fight as much as needed to move them forward!

To answer @Naveenpf, to the best of my knowledge, there is no budget issue in the sense of "we do not have enough money for this, let's kill the project". There is a budgeting process that needs to be followed, we need to understand how much budget we need and where it comes from.
I'm fairly new here, so I might have misunderstood a few things. The only one I'm fairly certain is that we all understand that maps are important and that @Yurik will fight as much as needed to move them forward!

That's correct. :-)

MaxSem added a subscriber: Tnegrin.Feb 24 2016, 9:35 PM

@mark We have solidified our budget planning for next year (good thing, because it's due tomorrow!). This is our plan:

4x maps backend servers from FY15-16
4x maps backend servers from FY16-17 (strategic)
4x varnish servers from FY16-17 (strategic)

We split the maps backends over two separate FY so that the data centers aren't in a strict lockstep. I've been looking over the FY16-17 planning guidelines and one issue is that the final budget is not announced until mid June. The budget goes live July 1st. I know we want to wait on ordering maps backend servers from the FY15-16 budget until we have a reasonable level of confidence in maps being approved in FY16-17 strategic budget.

The C-Levels will publish their plan by the end of March. The FDC is scheduled to have their feedback in by May 16th. If, by May 16th, we have support from the executive and FDC level for maps service, can operations commit to being able to have the maps servers purchased between May 16th and June 30th?

Yurik moved this task from Backlog to Tracking on the Maps (Kartotherian) board.Mar 10 2016, 5:48 AM
mark added a comment.Mar 10 2016, 6:28 PM

We'd need to order a bit earlier than May 16th, as otherwise we risk them not arriving in time and hitting next FY budget. Let's aim for end of April, with the information we have then?

cmarqu added a subscriber: cmarqu.Mar 15 2016, 12:51 PM

@mark Yes, lets move forward with this. thanks

Andrew triaged this task as High priority.Apr 14 2016, 7:55 PM
Andrew added a subscriber: Andrew.

@Tfinc, this is just a drive-by, but I think that the correct next step is to open a subtask of this ticket with the specific hardware request and add the ops-hardware-requests tag.

@Gehel Given that this is approved by Jaime and in plan pending FDC approval do you want close it out and re-open when we finalize on hardware or do you want to keep this open?

Gehel added a comment.Apr 14 2016, 8:14 PM

The related hardware requests are T131180 and T131880. So yes, I think we can close this for the moment.

Gehel closed this task as Resolved.Apr 14 2016, 8:14 PM