We need to know how much memory we expect workers to need in the next two years. Let's plot how memory has grown over time and use that to extrapolate.
Description
Related Objects
Event Timeline
Date | Models added |
2015-09-21 | enwiki-reverted, enwiki-damaging, enwiki-goodfaith, ptwiki-reverted, ptwiki-damaging, ptwiki-goodfaith, fawiki-reverted, fawiki-damaging, fawiki-goodfaith, enwiki-wp10 |
2015-10-07 | frwiki-wp10 |
2015-10-28 | dewiki-reverted, eswiki-reverted, hewiki-reverted, itwiki-reverted, idwiki-reverted, nlwiki-reverted |
2015-11-09 | etwiki-reverted, frwiki-reverted, trwiki-reverted, trwiki-damaging, trwiki-goodfaith, viwiki-reverted, ukwiki-reverted |
2016-01-11 | wikidatawiki-reverted |
2016-02-23 | plwiki-reverted |
2016-02-24 | arwiki-reverted |
2016-04-09 | ruwiki-reverted |
2016-04-13 | ruwiki-goodfaith, ruwiki-damaging |
2016-04-22 | huwiki-reverted |
2016-04-26 | nlwiki-damaging, nlwiki-goodfaith |
2016-04-30 | wikidatawiki-damaging, wikidatawiki-goodfaith |
2016-05-19 | svwiki-reverted |
2016-06-04 | nowiki-reverted |
2016-06-07 | ruwiki-wm10 |
2016-06-30 | enwiktionary-reverted, cswiki-reverted |
2016-07-03 | plwiki-damaging, plwiki-goodfaith |
Date | RES memory (KB) |
2015-09-21 | 358736 |
2015-10-07 | 410172 |
2015-10-28 | 500972 |
2015-11-09 | 601660 |
2016-01-11 | 609792 |
2016-02-23 | 613712 |
2016-04-09 | 632268 |
2016-04-13 | 666828 |
2016-04-22 | 691328 |
2016-04-26 | 731096 |
2016-04-30 | 741752 |
2016-05-19 | 749536 |
2016-06-04 | 791620 |
2016-06-07 | 847384 |
2016-06-30 | 899496 |
2016-07-03 | 923400 |
OK. Extrapolation complete. See https://commons.wikimedia.org/wiki/File:Ores_worker_memory.linear_model_extrapolation.svg for the plot.
If we keep up the pace, we'll need about 2.1 GB of memory per worker.
Right now, we have 16 * 6 = 96 workers in labs. If we were to replicate our capacity on labs (assuming similar CPU power), we would need 96 * 2.1GB = 201.6 GB of memory just for the ORES workers.
Assuming in about 2 years most models and wikis have been added, 2.1GB of memory per worker is a pretty good guideline. I suppose the number of workers translates directly to the number of scores we can do at any given point in time. Which, taking batching into account and guesstimating a mean value for scores per request, allows us to calculate the number of uncached request per sec that we can serve, right ? Which can be a helpful number to have.
Now to answer your question. No, 200GB does not sounds crazy. It sounds quite reasonable. We will have to split it over a number of boxes. Possible numbers are 2, 4 and 6.
I ran some numbers for CPU usage. Using https://tools.wmflabs.org/nagf/?project=ores#h_ores-worker-05_cpu, taking as a baseline the 20% cpu usage across the 4 cores and the ~5300 bogomips processing power reported by the CPUs and the number of 6 VMs, I 've got an estimate of roughly 25.5k bogomips current usage.
Keeping in mind that we want to:
a) Have around the same average CPU usage (that is ~20%) in the new production cluster (maybe even less to allow us room to grow)
b) Average per core bogomips these days is around 5k
we will need around 30 cores in total. Which is a nice low number we can easily outdo.
I 'll start the hardware requests process looking into the best way we can get those numbers into production.
Note: I know bogomips is not an exact measure of the power of a CPU but I only want a ballpark so it should suffice for now