Page MenuHomePhabricator

Extrapolate memory usage per worker forward 2 years
Closed, ResolvedPublic

Description

We need to know how much memory we expect workers to need in the next two years. Let's plot how memory has grown over time and use that to extrapolate.

Event Timeline

DateModels added
2015-09-21enwiki-reverted, enwiki-damaging, enwiki-goodfaith, ptwiki-reverted, ptwiki-damaging, ptwiki-goodfaith, fawiki-reverted, fawiki-damaging, fawiki-goodfaith, enwiki-wp10
2015-10-07frwiki-wp10
2015-10-28dewiki-reverted, eswiki-reverted, hewiki-reverted, itwiki-reverted, idwiki-reverted, nlwiki-reverted
2015-11-09etwiki-reverted, frwiki-reverted, trwiki-reverted, trwiki-damaging, trwiki-goodfaith, viwiki-reverted, ukwiki-reverted
2016-01-11wikidatawiki-reverted
2016-02-23plwiki-reverted
2016-02-24arwiki-reverted
2016-04-09ruwiki-reverted
2016-04-13ruwiki-goodfaith, ruwiki-damaging
2016-04-22huwiki-reverted
2016-04-26nlwiki-damaging, nlwiki-goodfaith
2016-04-30wikidatawiki-damaging, wikidatawiki-goodfaith
2016-05-19svwiki-reverted
2016-06-04nowiki-reverted
2016-06-07ruwiki-wm10
2016-06-30enwiktionary-reverted, cswiki-reverted
2016-07-03plwiki-damaging, plwiki-goodfaith
DateRES memory (KB)
2015-09-21358736
2015-10-07410172
2015-10-28500972
2015-11-09601660
2016-01-11609792
2016-02-23613712
2016-04-09632268
2016-04-13666828
2016-04-22691328
2016-04-26731096
2016-04-30741752
2016-05-19749536
2016-06-04791620
2016-06-07847384
2016-06-30899496
2016-07-03923400
Halfak triaged this task as High priority.Aug 4 2016, 2:06 PM
Halfak removed a project: Machine-Learning-Team.

OK. Extrapolation complete. See https://commons.wikimedia.org/wiki/File:Ores_worker_memory.linear_model_extrapolation.svg for the plot.

If we keep up the pace, we'll need about 2.1 GB of memory per worker.

Right now, we have 16 * 6 = 96 workers in labs. If we were to replicate our capacity on labs (assuming similar CPU power), we would need 96 * 2.1GB = 201.6 GB of memory just for the ORES workers.

Assuming in about 2 years most models and wikis have been added, 2.1GB of memory per worker is a pretty good guideline. I suppose the number of workers translates directly to the number of scores we can do at any given point in time. Which, taking batching into account and guesstimating a mean value for scores per request, allows us to calculate the number of uncached request per sec that we can serve, right ? Which can be a helpful number to have.

Now to answer your question. No, 200GB does not sounds crazy. It sounds quite reasonable. We will have to split it over a number of boxes. Possible numbers are 2, 4 and 6.

I ran some numbers for CPU usage. Using https://tools.wmflabs.org/nagf/?project=ores#h_ores-worker-05_cpu, taking as a baseline the 20% cpu usage across the 4 cores and the ~5300 bogomips processing power reported by the CPUs and the number of 6 VMs, I 've got an estimate of roughly 25.5k bogomips current usage.

Keeping in mind that we want to:

a) Have around the same average CPU usage (that is ~20%) in the new production cluster (maybe even less to allow us room to grow)
b) Average per core bogomips these days is around 5k

we will need around 30 cores in total. Which is a nice low number we can easily outdo.

I 'll start the hardware requests process looking into the best way we can get those numbers into production.

Note: I know bogomips is not an exact measure of the power of a CPU but I only want a ballpark so it should suffice for now