Page MenuHomePhabricator

Limit resources used by ORES
Closed, ResolvedPublic

Description

The ORES service uses significant resources on the SCB cluster. Currently, we don't enforce resource limits.

We should set up hard resource limits for ORES. The most pressing resources are RAM and disk quota. CPU priority is likely already taken care of by default systemd cgroup behavior.

Looking at http://0pointer.de/blog/projects/resources.html, limiting memory usage should not be much more effort than adding something like

[Service]
MemoryLimit=1G

Setting up disk quotas looks a bit harder. An alternative to native user quotas might be to move ORES storage & logging to a dedicated partition.

Event Timeline

elukey triaged this task as Medium priority.Oct 19 2016, 11:06 AM

The Reading team is currently investigating whether it needs to procure additional hardware to support the Mobile-Content-Service and other Reading focused[1] services through FY2018.

Since the MCS and ORES run on the same hardware, this outcome of this ticket will largely determine the answer to that question.

Is there any update on this conversation? It has been a few months since this was filed so I am not sure if this is still seen as a need.

@Halfak have you looked into this at all? I know with annual planning and your proposal for a team this could be on your radar.

[1]: Currently just the Trending-Service

Hey! Thanks for the ping. It turns out that the referenced CPU/memory issues weren't due to ORES over-use at all! Also, in the meantime, we've made several advances in performance/memory footprint of ORES. To my knowledge, we have not had issues.

@Halfak, the referenced issues were very definitely ORES using almost all memory. Things have certainly improved since, but given the fairly experimental nature a service like ORES will always have more risk of unexpected behavior than most other production services.

@Halfak, the referenced issues were very definitely ORES using almost all memory. Things have certainly improved since, but given the fairly experimental nature a service like ORES will always have more risk of unexpected behavior than most other production services.

The last occurrence was on 2017-02-04 when the ORES MW API calls and CP pre-caching requests had to be disabled so as not to bring SCB completely down. FTR, SRE also agree on moving ORES out of SCB.

@mobrovac, I'd not been notified about SRE coming to a conclusion about moving ORES out of SCB. Can you link me to a discussion or a statement about that conclusion?

Regarding the event on 2/4, that was not a threat to bringing SCB down. ORES was operating at the top of its hard-limited capacity because of a severe over-utilization event. Our own backpressure systems were returning 503s as expected and we decided to cut off a user because their behavior was limiting the usefulness of ORES to others. At the peak, combined CPU usage across the SCB nodes was 50-60%. Not great for sustained usage, but certainly not threatening to take SCB down.

@GWicke, I remember looking into that event and determining that it was not ORES using all of the memory, but that there were things we could do to improve (reduce) ORES memory usage. Thus, the advances I referenced earlier. The experimental nature of ORES is about how it interacts with the social production dynamics of our wikis -- not in how it acts as a web service. Running a job queue isn't rocket science and I think our uptime reflects that.

So, rather than continue this debate on an unrelated phab task, here's what I propose.

  1. @GWicke, If you think that this ticket is still relevant, please make a statement about how you believe ORES resource usage continues to be roblematic so that we can iterate on solutions.
  2. @Fjalapeno, Please reference T157222: Estimate ORES capex for FY2017-18 for discussions about what hardware resources we'll be investing in for FY2018. I think *that* ticket will be your determining factor in whether or not ORES will running on the same hardware.
Halfak updated the task description. (Show Details)

@Halfak ahh… thanks for the link! Forgot about that ticket when looking around.

@mobrovac, I'd not been notified about SRE coming to a conclusion about moving ORES out of SCB. Can you link me to a discussion or a statement about that conclusion?

Please stop distorting other people's words. This is honestly getting annoying. Re-post of my earlier comment:

FTR, SRE also agree on moving ORES out of SCB.

(bold added)

@mobrovac, let me try again. Who from SRE did you talk to? Was that agreement public? Can you link me to any discussion? I'd like to re-hash that conversation since I'm keen on resolving T157222: Estimate ORES capex for FY2017-18

@mobrovac, let me try again. Who from SRE did you talk to? Was that agreement public? Can you link me to any discussion? I'd like to re-hash that conversation since I'm keen on resolving T157222: Estimate ORES capex for FY2017-18

Ok, let's try again :) Ops and Services have brought this up multiple times in our sync-up meetings. There is no written (public) record of it, though. For context, the discussion points were that (a) ORES is much more computationally intensive than the rest of the services hosted there; and (b) it is the only service on SCB that is not stateless. So the consensus has been basically that both in conceptual and practical terms it would be more beneficial for all parties involved if ORES were to be moved to dedicated hardware. In that sense I'm glad to see T157222: Estimate ORES capex for FY2017-18.

Great! But note that ORES is stateless unless you consider our cache to be "state". Surely "consensus" can't happen for "all parties" unless I'm involved in the conversation. If I were, I could have probably cleared up some misconceptions and asked the ops reps about how they'd like to move forward with purchasing more hardware. Maybe we can try that discussion again. I'll work that out and make sure we have a rep from services.

I don't know about disk quota but regarding RAM. This easily can be added in the puppet using MemoryHigh= and MemoryMax= options in the systemd configs but It's not easily guessable how much ORES is now using (because other services use memory making the total memory useless in this case). In SRE perspective, how much should be the limit?

awight claimed this task.
awight subscribed.

We've moved to a dedicated cluster—the best possible way to limit resources ;-)