Maniphest T142046

Extrapolate memory usage per worker forward 2 years
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Aug 3 2016, 10:06 PM

Description

We need to know how much memory we expect workers to need in the next two years. Let's plot how memory has grown over time and use that to extrapolate.

Related Objects

Mentioned In: T157222: Estimate ORES capex for FY2017-18
T142578: codfw/eqiad:(9+9) hardware access request for ORES

Event Timeline

Halfak created this task.Aug 3 2016, 10:06 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 3 2016, 10:06 PM

Date	Models added
2015-09-21	enwiki-reverted, enwiki-damaging, enwiki-goodfaith, ptwiki-reverted, ptwiki-damaging, ptwiki-goodfaith, fawiki-reverted, fawiki-damaging, fawiki-goodfaith, enwiki-wp10
2015-10-07	frwiki-wp10
2015-10-28	dewiki-reverted, eswiki-reverted, hewiki-reverted, itwiki-reverted, idwiki-reverted, nlwiki-reverted
2015-11-09	etwiki-reverted, frwiki-reverted, trwiki-reverted, trwiki-damaging, trwiki-goodfaith, viwiki-reverted, ukwiki-reverted
2016-01-11	wikidatawiki-reverted
2016-02-23	plwiki-reverted
2016-02-24	arwiki-reverted
2016-04-09	ruwiki-reverted
2016-04-13	ruwiki-goodfaith, ruwiki-damaging
2016-04-22	huwiki-reverted
2016-04-26	nlwiki-damaging, nlwiki-goodfaith
2016-04-30	wikidatawiki-damaging, wikidatawiki-goodfaith
2016-05-19	svwiki-reverted
2016-06-04	nowiki-reverted
2016-06-07	ruwiki-wm10
2016-06-30	enwiktionary-reverted, cswiki-reverted
2016-07-03	plwiki-damaging, plwiki-goodfaith

Date	RES memory (KB)
2015-09-21	358736
2015-10-07	410172
2015-10-28	500972
2015-11-09	601660
2016-01-11	609792
2016-02-23	613712
2016-04-09	632268
2016-04-13	666828
2016-04-22	691328
2016-04-26	731096
2016-04-30	741752
2016-05-19	749536
2016-06-04	791620
2016-06-07	847384
2016-06-30	899496
2016-07-03	923400

Halfak triaged this task as High priority.Aug 4 2016, 2:06 PM

Halfak removed a project: Machine-Learning-Team.

OK. Extrapolation complete. See https://commons.wikimedia.org/wiki/File:Ores_worker_memory.linear_model_extrapolation.svg for the plot.

If we keep up the pace, we'll need about 2.1 GB of memory per worker.

Right now, we have 16 * 6 = 96 workers in labs. If we were to replicate our capacity on labs (assuming similar CPU power), we would need 96 * 2.1GB = 201.6 GB of memory just for the ORES workers.

@akosiaris, is ^ crazy?

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Aug 5 2016, 4:08 PM

ArielGlenn subscribed.Aug 8 2016, 4:26 PM

Halfak claimed this task.Aug 8 2016, 4:30 PM

Ladsgroup closed this task as Resolved.Aug 8 2016, 9:59 PM

Assuming in about 2 years most models and wikis have been added, 2.1GB of memory per worker is a pretty good guideline. I suppose the number of workers translates directly to the number of scores we can do at any given point in time. Which, taking batching into account and guesstimating a mean value for scores per request, allows us to calculate the number of uncached request per sec that we can serve, right ? Which can be a helpful number to have.

Now to answer your question. No, 200GB does not sounds crazy. It sounds quite reasonable. We will have to split it over a number of boxes. Possible numbers are 2, 4 and 6.

I ran some numbers for CPU usage. Using https://tools.wmflabs.org/nagf/?project=ores#h_ores-worker-05_cpu, taking as a baseline the 20% cpu usage across the 4 cores and the ~5300 bogomips processing power reported by the CPUs and the number of 6 VMs, I 've got an estimate of roughly 25.5k bogomips current usage.

Keeping in mind that we want to:

a) Have around the same average CPU usage (that is ~20%) in the new production cluster (maybe even less to allow us room to grow)
b) Average per core bogomips these days is around 5k

we will need around 30 cores in total. Which is a nice low number we can easily outdo.

I 'll start the hardware requests process looking into the best way we can get those numbers into production.

Note: I know bogomips is not an exact measure of the power of a CPU but I only want a ballpark so it should suffice for now

akosiaris mentioned this in T142578: codfw/eqiad:(9+9) hardware access request for ORES.Aug 10 2016, 2:59 PM

Halfak mentioned this in T157222: Estimate ORES capex for FY2017-18.Feb 5 2017, 4:23 PM

Extrapolate memory usage per worker forward 2 yearsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Extrapolate memory usage per worker forward 2 years
Closed, ResolvedPublic
Actions