Page MenuHomePhabricator

Can we have more RAM on stat machines
Closed, ResolvedPublic

Description

Can we have more RAM on stats machines?

This is mostly a question for @elukey and dc-ops Could we explore whether is easy to add more RAM to the stat machines specially to the newer ones we have? To give you a sense of the desired state, the ideal situation is 1-2 TB RAM

Event Timeline

Nuria created this task.Jun 17 2020, 8:55 PM
leila added a subscriber: leila.Jun 17 2020, 9:15 PM

For context, Nuria and I discussed this topic now and I'm going to make some changes in the title of the task and the description to expand it and make it also more specific.

leila renamed this task from Can we have more RAM on stats machines to Can we have more RAM and/or Storage on stat machines .Jun 17 2020, 9:21 PM
leila updated the task description. (Show Details)
Nuria updated the task description. (Show Details)Jun 17 2020, 9:25 PM
leila renamed this task from Can we have more RAM and/or Storage on stat machines to Can we have more RAM on stat machines .Jun 17 2020, 9:47 PM
leila updated the task description. (Show Details)

@leila Usually if there are free slots for extra ram banks or if we can replace the existing ones with bigger banks it may work, but 1-2 TB of RAM is not used even for the most powerful Database host, so I am wondering why we need such a big requirement :)

What are the use cases? Maybe something around 128/256G could work as well? (seems more doable).

@elukey beat me to comment (as usual :)
Just as a comparison, the hadoop cluster has 4Tb total over 54 nodes.
Machines with more than 256G or 512G (which is already a lot!) are usually small super-computers (clusters).

Nuria added a subscriber: diego.Jun 18 2020, 4:07 PM

Other things we talked about:

  • can't this use case bernefit from usage of GPU resources (cc @diego )
  • could we work on these models running distributed on hadoop?
leila added a comment.Jun 18 2020, 7:24 PM

Thanks for chiming in, all. I'm coordinating the ask on behalf of my team to make sure I'm capturing the needs of all the folks on the team. I'm going to not ping specific people on our team to chime in on this task to make sure I can convey the bigger ask clearly. If you have specific questions that I can't answer, I'm happy to take them back to folks in my team or ping specific people on the task. With this in mind, I will try to address the points raised above: :)

  • why not hadoop? Nuria and I discussed this yesterday. At a high level, the reason is that certain stacks of technology we want to experiment with are not hadoop compatible, at least not without research scientists spending a lot of time to make them compatible which we want to avoid at the experimentation stages of the research. Two examples are graph processing and language models using embedding vectors. Both of these require large RAM. The former is typically not doable in hadoop (compatible code from available packages such graphframes provides only limited functionality.) The example of the latter is fasttext.
  • RAM size? I'm representing the needs of the whole team here. I see experiments in which 1 research scientist needs 256GB RAM when the spikes of exploration and research work is happening. I have 5 heavy users of these resources in my team, 3 of whom are very heavy users. We have interns who are working on annual plan commitments on stat machines (they need exploration and piloting space and for similar reasons as mentioned earlier can't do all of their work in hadoop), and the team activity in this space is growing during FY21. So we will be hitting these machines more. I want to have 1TB+ (if possible) in one of the stat machines so we don't have to come back to this request in 6 months again.

I have mentioned this to Nuria yesterday and will underline here as well: I empathize with the goal of sending more and more of the computation work to hadoop and myself and my team are committed to do that when possible (we're in fact going to hire a research engineer who is going to help us do even more of that.). However, not all the early exploratory work that research scientists do can happen in hadoop and we need much more RAM on the stat machines in order to be able to do that work. The cost of not having that RAM at the moment is significant decrease in efficiency on our end (not your fault, this is the first time I'm bringing this request up so clearly to you all.:)

I want to have 1TB+ (if possible) in one of the stat machines so we don't have to come back to this request in 6 months again.

Probably RAM can be increased on the smaller machines (that have 32G) to 10 times that amount. As far as I can tell not even AWS offers "hardware" with 1TB of ram to do model training (they go as far as .7 TB and those cost 4$ and hour, so it is quite an expensive offering) we can look into how much RAM we can provision and will update ticket with that.

(for the record, Nuria and I talked briefly now.)

Seeking information on how much RAM we can add to one of the existing machines sounds good to me. If we have the technical constraints and the cost, we can make a decision what makes sense to do. I appreciate you looking into this.

I will try to add a last minute request for the budget, but realistically I think that only two/three stat100x boxes may go up to 128G of ram from their actual status. There are multiple reasons for this:

  1. Some stat100x are very old (like stat1004), so spending money on them is not cost effective (basically already out of warranty etc..).
  2. The hosts are different in hw specs etc.., and I am not sure if we can upgrade RAM at the current stage and by how much.
  3. The biggest hosts that we have in the WMF are running 512G of ram, and are database hosts :D

Also @leila computations can be spread on multiple hosts, they are all the same now (config wise) and data can be freely copied via rsync when needed. We spent a lot of time in making this happen to ease the usage of these machines to all analytics users, the goal was to avoid clusters of usage like on stat1007 (a lot of resource usage complains in the past were related to specific hosts being overwhelmed). This is to say that even if there are 3/5 users that need a lot of capacity, they can be spread among multiple nodes rather than only one.

In my opinion, an experiment that saturates 64G/128G of ram so fast is a good watermark to delimit what should be "distributed" and what should not, and we can probably help when needed. I completely get your point about being flexible in hw specs so we don't spend too much development/research time on early optimizations, but we'd also need to find a good trade-off to avoid oversizing hw specs. My 2c :)

Hi, I did some research in relation to the topic - Writing my findings here (sorry for the disorganization).
Sofware:

  • Fasttext is a C++ library built at Facebook (FAIR) to do machine-learning on text (classification and embeddings). It is an extension of the original word2vec library, adding learning on words morphological structure (particularly interesting for languages where words morphology convey meaning, such as german or torkish). Internally, it reads the input data from a single thread and uses multi-threading for model building. It doesn't provide ways to distribute the computation on multiple machines. More on how FastText works internally here (blog post), here (blog post) and here (scientific paper).
  • GenSim - An project trying to provide a single entrypoint for libraries working with text (it is supposedly built for topic-modeling, but I assume you can do more with it).
  • Spark has a distributed version of the Word2Vec algorythim, but This quora page explains in not too complex way why it is difficult to distribute embeddings-computation (memory limitation, as every worker needs to store the full vocabulary).

Hardware:

  • AWS offers VERY large RAM machines. The biggest one I found is u-24tb1.metal, offering 448 logical CPU (224 physical cores multithreaded) and 24576Gb RAM (24TB!). Interestingly there is no pricing for such a beast :) An example of less high-end ones with a price (to get an idea) are x1e.32xlarge offering 128 logical CPUs, 3.9Tb RAM and 2x2Tb SSDs - Costing $26.6 per hour...
  • I found examples of servers with 64 cores and 1Tb RAM to buy on the internet. The examples I found were not that expensive (~$5k).
Nuria added a comment.Jun 19 2020, 2:51 PM

Let's do some exploration with dc ops team as to what kind of hardware could we provision both in terms of bumping up the RAM of existing machines but in terms of ordering new refreshes for these when time comes.

leila added a comment.Jun 19 2020, 4:03 PM

I will try to add a last minute request for the budget,

In case it helps, I gave Faidon a heads up about this conversation yesterday and he was okay if we miss capturing the cost as part of CapEx budget count now given that the expected additional cost will be relatively (w.r.t. the whole CapEx) minor. This being said, if we can get a number in CapEx, that's even better. thanks!

but realistically I think that only two/three stat100x boxes may go up to 128G of ram from their actual status.

I do hope at the end you have better news for me. ;)

There are multiple reasons for this:

  1. Some stat100x are very old (like stat1004), so spending money on them is not cost effective (basically already out of warranty etc..).
  2. The hosts are different in hw specs etc.., and I am not sure if we can upgrade RAM at the current stage and by how much.

I totally understand this.

  1. The biggest hosts that we have in the WMF are running 512G of ram, and are database hosts :D

right. :) not a strong reason not to upgrade though, imho.

Also @leila computations can be spread on multiple hosts, they are all the same now (config wise) and data can be freely copied via rsync when needed. We spent a lot of time in making this happen to ease the usage of these machines to all analytics users, the goal was to avoid clusters of usage like on stat1007 (a lot of resource usage complains in the past were related to specific hosts being overwhelmed). This is to say that even if there are 3/5 users that need a lot of capacity, they can be spread among multiple nodes rather than only one.

I did mention this in my chat with Nuria and not here: the team is really thankful for all the work you have done so far to make the stat machines better. There were a few things they specifically called out: stat1008 now having GPU. SWAP notebooks having access to all machines, expanded availability of data sources on each machine (better hadoop access, etc.). This is to say: the fundamental work you all have done is visible to us and we really appreciate it. :)

The above being said, some type of research (such as graph processing as I mentioned earlier) can be heavy users of RAM and one researcher can easily take up the whole 128G in one go, and they may run out of RAM. (One thing I talked with Nuria about that you may want to know as well is that these needs are spiky and yet real. We're not talking about 1TB RAM being used up at capacity everyday of the week, but when the need arises (once a week, once a month for one researcher), then we sometimes need a lot of RAM.) I should say that we have also considered doing some of the processing outside of the WMF servers. That, as you can imagine, quickly runs into challenges: part of the data that we work with is private and can't go to third parties. Even when it's not private, we need access to the whole data infrastructure in many cases, so taking parts of the data out won't help anyway.

In my opinion, an experiment that saturates 64G/128G of ram so fast is a good watermark to delimit what should be "distributed" and what should not, and we can probably help when needed.

Assuming that adding hw is not causing a lot of challenges for you all and the DC Ops folks, I gently disagree for two reasons: first, your team's resources are also very limited and we're being very cautious about asking your team to support us figuring out how to optimize code (we're going to hire a research engineer for that.). second, we're relatively frequently experimenting with the state of the art or very experimental machine learning models and the cost of waiting at the experiment time is very high. To give you a sense: I'm building a prediction model to detect an outcome. I read a paper and see that model x that is just put out may do really well for my problem. I want to be able to quickly test this model x on the problem we have. If at this stage, I need to wait even just for a few days for a code to get ready for me to test, the research flow will get broken.

I want to be very clear that at the research experimentation level, in many cases, we can use hadoop and we do it. The challenge is that in many other cases we can't, and that's why I'm asking for more RAM here.

I completely get your point about being flexible in hw specs so we don't spend too much development/research time on early optimizations, but we'd also need to find a good trade-off to avoid oversizing hw specs. My 2c :)

I appreciate this. :) Can you expand in what sense you're concerned about oversizing hw? is it about spending where we don't need it and inflating the budget? or some other reason(s)? :)

leila added a comment.Jun 19 2020, 4:06 PM

Hi, I did some research in relation to the topic - Writing my findings here (sorry for the disorganization).
Sofware:

  • Fasttext is a C++ library built at Facebook (FAIR) to do machine-learning on text (classification and embeddings). It is an extension of the original word2vec library, adding learning on words morphological structure (particularly interesting for languages where words morphology convey meaning, such as german or torkish). Internally, it reads the input data from a single thread and uses multi-threading for model building. It doesn't provide ways to distribute the computation on multiple machines. More on how FastText works internally here (blog post), here (blog post) and here (scientific paper).
  • GenSim - An project trying to provide a single entrypoint for libraries working with text (it is supposedly built for topic-modeling, but I assume you can do more with it).
  • Spark has a distributed version of the Word2Vec algorythim, but This quora page explains in not too complex way why it is difficult to distribute embeddings-computation (memory limitation, as every worker needs to store the full vocabulary).

Thanks for sharing these. And to underline an earlier point: these were just two examples. There are more and as I explained to elukey's point now, they delay experimentation and exploration in early stages of the research that can cause sometimes major inefficiencies.

Hardware:

  • AWS offers VERY large RAM machines. The biggest one I found is u-24tb1.metal, offering 448 logical CPU (224 physical cores multithreaded) and 24576Gb RAM (24TB!). Interestingly there is no pricing for such a beast :) An example of less high-end ones with a price (to get an idea) are x1e.32xlarge offering 128 logical CPUs, 3.9Tb RAM and 2x2Tb SSDs - Costing $26.6 per hour...
  • I found examples of servers with 64 cores and 1Tb RAM to buy on the internet. The examples I found were not that expensive (~$5k).

Thanks for looking into this. It's very helpful. I did some research in parallel and while my numbers were higher than yours, not more than 2x more.

I will try to add a last minute request for the budget,

In case it helps, I gave Faidon a heads up about this conversation yesterday and he was okay if we miss capturing the cost as part of CapEx budget count now given that the expected additional cost will be relatively (w.r.t. the whole CapEx) minor. This being said, if we can get a number in CapEx, that's even better. thanks!

This is done, Faidon added some budget for extra RAM!

There are multiple reasons for this:

  1. Some stat100x are very old (like stat1004), so spending money on them is not cost effective (basically already out of warranty etc..).
  2. The hosts are different in hw specs etc.., and I am not sure if we can upgrade RAM at the current stage and by how much.

I totally understand this.

  1. The biggest hosts that we have in the WMF are running 512G of ram, and are database hosts :D

right. :) not a strong reason not to upgrade though, imho.

It is a good indicator that we may be over-provisioning a client node, this is why I am bringing it up. Workloads are different of course but it always good to have some perspective on the rest of the infra :)

The above being said, some type of research (such as graph processing as I mentioned earlier) can be heavy users of RAM and one researcher can easily take up the whole 128G in one go, and they may run out of RAM. (One thing I talked with Nuria about that you may want to know as well is that these needs are spiky and yet real. We're not talking about 1TB RAM being used up at capacity everyday of the week, but when the need arises (once a week, once a month for one researcher), then we sometimes need a lot of RAM.) I should say that we have also considered doing some of the processing outside of the WMF servers. That, as you can imagine, quickly runs into challenges: part of the data that we work with is private and can't go to third parties. Even when it's not private, we need access to the whole data infrastructure in many cases, so taking parts of the data out won't help anyway.

Sure makes sense, I completely understand. One TB of ram still feels a lot :)

In my opinion, an experiment that saturates 64G/128G of ram so fast is a good watermark to delimit what should be "distributed" and what should not, and we can probably help when needed.

Assuming that adding hw is not causing a lot of challenges for you all and the DC Ops folks, I gently disagree for two reasons: first, your team's resources are also very limited and we're being very cautious about asking your team to support us figuring out how to optimize code (we're going to hire a research engineer for that.). second, we're relatively frequently experimenting with the state of the art or very experimental machine learning models and the cost of waiting at the experiment time is very high. To give you a sense: I'm building a prediction model to detect an outcome. I read a paper and see that model x that is just put out may do really well for my problem. I want to be able to quickly test this model x on the problem we have. If at this stage, I need to wait even just for a few days for a code to get ready for me to test, the research flow will get broken.

Completely understand.

I completely get your point about being flexible in hw specs so we don't spend too much development/research time on early optimizations, but we'd also need to find a good trade-off to avoid oversizing hw specs. My 2c :)

I appreciate this. :) Can you expand in what sense you're concerned about oversizing hw? is it about spending where we don't need it and inflating the budget? or some other reason(s)? :)

My main concern is that we are striving for a lot of RAM based on some high level calculations, that may not hold if we simply follow best practices (for example, what Joseph was suggesting, more use of Spark, better evangelization of tools, etc..).

To be clear: I'll try to do my best to see how much RAM can be added to the stat100x boxes, I am only suggesting to establish a better set of guidelines about how to use the full potential of what we have right now. There might be some best practices that we could work on together to alleviate the problem of using hadoop, etc..

Nuria added a comment.Jun 19 2020, 4:51 PM

To give you a sense: I'm building a prediction model to detect an outcome. I read a paper and see that model x that is just put out may do really well for my problem. I want to be able to quickly test this model x on the problem we have. If at this stage, I need to wait even just for a >few days for a code to get ready for me to test, the research flow will get broken.

This makes a lot of sense, it is hard to do science if you are stopped by engineering hurdles at the exploratory phase

elukey mentioned this in Unknown Object (Task).Jun 22 2020, 3:42 PM

Opened T256017 to SRE :)

Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM
elukey closed this task as Resolved.Aug 14 2020, 9:00 AM
elukey claimed this task.

Closing this task since we reached a consensus about what to buy in T256017.