Page MenuHomePhabricator

Estimate ORES capex for FY2017-18
Closed, ResolvedPublic

Description

In T142046 we extrapolated ORES memory usage per worker based on current trends.

In the last couple of quarters, we did a lot of work to implement PCFGs (T148867, T151819) and we're looking to implement some more CPU intensive methods (e.g. T145812).

This task is done when we discuss the hardware needs of ORES and settle on a proposal for capex.

To-do:

  • Summarize current usage and recent trends in scoring rates
  • Estimate additional resources necessary for PCFGs, hash vectors, more models, etc.

Event Timeline

Halfak updated the task description. (Show Details)
Halfak renamed this task from Estimate ORES capex for FY2017 to Estimate ORES capex for FY2018.Feb 14 2017, 6:11 PM

Currently, a scoring request for a single revision with all available models takes 1.35s on average. We have 45 workers per node and we're currently running on 4 SCB nodes for a total of 180 workers. That means, we should be able to handle approximately 133 scoring jobs per second at max capacity assuming all is flowing like it does during regular load. Note that this doesn't account for cached scores -- for which we could handle a much higher capacity. Currently, we start overloading at 67 scoring requests per second (non-cached), so there's some inefficiency in there.

Currently, we see about 10-18 precaching requests per second depending on the time of day and bot spikes. With the usual usage, that's boosted to 11-20 requests per minute. With the recent api.php-tastrophy, we saw 67 scoring requests per second sustained for a short period.

On each SCB node, we use ~80GB of memory and 5% of 24 CPU cores. This usage spiked to 45% of 24 CPU cores during our max load (67 scoring requests per second).

Currently, we are using the following hosts:

Web/worker servers

  • scb1001.eqiad.wmnet
  • scb1002.eqiad.wmnet (canary)
  • scb1003.eqiad.wmnet
  • scb1004.eqiad.wmnet
  • scb1001.codfw.wmnet (not usable)
  • scb1002.codfw.wmnet (not usable)

Redis servers

  • oresrdb1001.eqiad.wmnet
  • oresrdb1002.eqiad.wmnet (mirrored from oresrdb1001)

We can't fail over to codfw because we have no redis nodes there.

Aklapper renamed this task from Estimate ORES capex for FY2018 to Estimate ORES capex for FY2017-18.Feb 17 2017, 3:53 AM

Does this relate to the Kubernetes work at all? Could ORES be run as a Kube container?

Currently, a scoring request for a single revision with all available models takes 1.35s on average.

My estimation with considering the time they spend in queue (when the queue is almost full) is 3.3 seconds

On each SCB node, we use ~80GB of memory and 5% of 24 CPU cores.

scb1001 and scb1002 each have total of 32 GB memory so we definitely are not using 80 GB ;) It seems workers and/or uwsgi ones are sharing memory. Can't say which one.

Eek! Yeah, you're certainly right about memory usage. I got that one from our grafana. I'll find another way to work that out. :S

Does this relate to the Kubernetes work at all? Could ORES be run as a Kube container?

It definitely could. Last I heard, that wasn't an option right now. I'd be interested in pursuing kube if ops gave the OK for a production service.

I don't know if ores is already doing this, but we've been interested in including ores wp10 score in the search indices as a scoring factor. To utilize this we would need to be sending requests for all content edits as they happen. I think this is already being done, but I wanted to make sure the hardware estimate is assuming that all edits will be scored in approximately real time.

I don't know if ores is already doing this, but we've been interested in including ores wp10 score in the search indices as a scoring factor. To utilize this we would need to be sending requests for all content edits as they happen. I think this is already being done, but I wanted to make sure the hardware estimate is assuming that all edits will be scored in approximately real time.

If the wp10 gets done alongside other precaching tasks, it would have virtually no impact on computational resources (due to sharing features, etc.) but if it's going to be done in another way we need a hell lot of more celery workers/uwsgi workers

FYI, over in T143743, we are talking about creating a public stream endpoint that contains various ORES scores. I betcha @EBernhardson could use this to update indices, rather than requesting every content edit score directly from ORES.

If the wp10 gets done alongside other precaching tasks, it would have virtually no impact on computational resources (due to sharing features, etc.) but if it's going to be done in another way we need a hell lot of more celery workers/goodfaith workers

It sounds like as long as we hit the same endpoint in ores that the precaching does then the current hardware estimates cover all the necessary work?

FYI, over in T143743, we are talking about creating a public stream endpoint that contains various ORES scores. I betcha @EBernhardson could use this to update indices, rather than requesting every content edit score directly from ORES.

Not currently, but a re-architecture could allow this. The problem is search needs to batch together all the updates to a document into one single update, rather than sending the content updates in one piece, and the wp10 scores in another piece. This is because a document update in search is equivilent to delete + recreate of the whole doc, even for a single property. To utilize a streamed endpoint we would need to trigger the update jobs based on some join of multiple streams (edit+ores) rather than the current method that triggers from mediawiki edit hooks.

It sounds like as long as we hit the same endpoint in ores that the precaching does then the current hardware estimates cover all the necessary work?

Sorry. I wasn't clear enough. There are three scenarios:

  • If the scores being received using ChangePropagation or EventStream (which would use CP AFAIK). No technical pressure would be put on ORES.
  • If you start to get them after they are made and these scores are not precached (we don't precache wp10 scores but that can change easily), we would need a lot more capacity.
  • If they get precached (which is an easy thing and it's not putting pressure on the server) and you request scores after a while (let's say several minutes after the edit is made). We would only need more uwsgi workers (not expensive but still some pressure)

I hope that's clear enough.

To utilize a streamed endpoint we would need to trigger the update jobs based on some join of multiple streams (edit+ores) rather than the current method that triggers from mediawiki edit hooks.

@Ladsgroup From my understanding it seems that Review Stream will likely trigger 2 separate events: 1 for the edit, and then 1 for the ORES score shortly after.

For the use cases of updating RESTBase summaries and processing edits for the Trending Edits API, we were planning on just ignoring the first event since we know that the ORES scores aren't ready - and just waiting for the second event to make the update.

The downside is that there will be a time delay in processing the edit… but I don't think it is significant delay.

@Ottomata is my understanding correct here?

The current discussion is looking that way yes, but note that the revision-score (ORES) event will not have all of the same information that the revision-create event has. We were previously talking about merging these events into one in a new delayed stream, but it looks like we don't want to go that route right now (unless there is a huge need for it).

Given @Fjalapeno's assessment, I think we are fine if @EBernhardson gets the scores from EventStream and don't hit ores endpoint directly.

but note that the revision-score (ORES) event will not have all of the same information that the revision-create event has.

@Ottomata curious, would it be possible to keep it as 2 separate events, BUT include the information from the revision-create information on the revision-score event as well?

That gets tricky, since the source of the revision-score (change-prop? ORES?) is not the same as the source of the revision-create (MediaWiki). But! The answer is not no! I think I'd prefer not to, but we aren't sure yet.

OK. This is mostly settled and it happened on private emails. I'm going to copy some of the highlights here.

  • @Joe said that he'd like to see us increase capacity (5x) if we're going to continue to serve api.php
  • I pushed back saying that we can serve ~66 scores per second which is actually quite a lot under sane query patterns (e.g. 2 parallel requests for 50 scores is ~14 scores per second (20% of our capacity).
  • @mark noted that we can double our capacity by running "active-active" in eqiad and codfw so we started working out caching patterns for that
  • @Ottomata suggested that we consider using the kubernetes setup that ops is working on but @yuvipanda recommended to not bet on that being ready soon, so instead, we're looking at spec'ing capex for a set of kubernetes class machines for hosting ORES that can be merged into the kubernetes cluster when it's ready
  • kubernetes servers have 16 cores and 64GB. We can fit 32 web and 32 celery workers in that. Based on that estimate, we can match our current capacity in eqiad with 6 kubernetes servers. However, if our per-celery worker memory use doubles from where it is now (as is currently estimated to happen in ~2 years), then we'll need 9 kubernetes servers.

The conclusion is that we're going to use some capex from FY2017 (this year) to purchase kubernetes class machines this year to run in codfw. This will allow us to experiment with active-active and maybe even take some load off of the SCB nodes in the short term. We'll also budget for 9 addition kubernetes machines in FY2018 to run in eqiad. Once those nodes are ready, we'll be able to fully get off of the SCB machines and we'll be running active-active at double the capacity. We'll also be prepared for the increased memory usage we've projected for the next two years.

So, it looks like we have a solid plan. Really the only open question that remains is if we can use the capex from this year (was going to be for PAWS dbs, but that didn't work out) for ORES nodes in codfw. That will be up to @DarTar.

  • kubernetes servers have 16 cores and 64GB. We can fit 32 web and 32 celery workers in that. Based on that estimate, we can match our current capacity in eqiad with 6 kubernetes servers. However, if our per-celery worker memory use doubles from where it is now (as is currently estimated to happen in ~2 years), then we'll need 9 kubernetes servers.

Is it worth trying to decrease memory usage first? Shared memory or dedicated workers for the more frequently needed models could mean a pretty large reduction if I understand correctly. The usual wisdom is that hardware is cheaper than programmers, not sure if that applies to the WMF though...

On each SCB node, we use ~80GB of memory and 5% of 24 CPU cores.

scb1001 and scb1002 each have total of 32 GB memory so we definitely are not using 80 GB ;) It seems workers and/or uwsgi ones are sharing memory. Can't say which one.

It's the standard sharing done by the kernel for Copy on Write data. What is actually happening is that all the models data is only once kept in memory. As to why the stats differ, it's a very long story with regards to how memory gets allocated and managed in the VM subsystem. If you are interested in the technical side the following 2 blog articles are useful.

https://techtalk.intersec.com/2013/07/memory-part-1-memory-types/ and https://techtalk.intersec.com/2013/07/memory-part-2-understanding-process-memory/. The ORES models data case falls into quadrant 1, hence Anonymous Private memory. That gets counted by the usual tools multiple times as each process believes the have a whole copy to themselves. The kernel of course knows better and since no write happens to that memory, COW deduplication works wonders. But I would use the 80GB memory number for a CAPEX planning. It's a way less infrastructure/installation dependent metric.