In order to take the Recommendation API to production, we're thinking of using Cassandra as a true storage engine (as opposed to a caching layer) for storing model predictions. The amount of data we need to store will change over time, but to start with, we're going to need about 2.2GB of space (greatly simplified).
According to the article count data per Wikipedia, and top 50 mostly used language pairs in Content Translation, we'll need to store 5,671,519 rows of Wikidata item ID's and 50 columns of double precision floating point numbers for those language pairs. Along with the Wikidata ID (assuming it fits in 8 bytes), 50 floating point numbers for reach row will yield about 2.2 GB for all the rows. Not all language pairs need to store more than 5M rows. For example, while en-nb needs 5,671,519 rows, nn-nb needs 138,664 rows. So most rows will contain 0's for some language pairs. Here's the number of items each language pair needs to store:
'nn-nb' => 138664 'es-pt' => 428172 'pt-es' => 428172 'fr-es' => 570126 'ru-uk' => 682171 'uk-ru' => 682171 'es-ca' => 842702 'ca-es' => 842702 'ru-hy' => 1238608 'ru-kk' => 1254911 'es-gl' => 1277457 'ru-be' => 1325837 'es-ast' => 1350141 'fr-ca' => 1412828 'ru-ba' => 1437744 'en-de' => 3475689 'en-fr' => 3674604 'en-nl' => 3736523 'en-ru' => 4190437 'en-it' => 4226307 'en-es' => 4244730 'es-en' => 4244730 'en-vi' => 4492727 'en-ja' => 4560015 'en-zh' => 4659402 'en-pt' => 4672902 'en-uk' => 4872608 'en-fa' => 5040708 'en-sr' => 5063177 'en-ca' => 5087432 'en-ar' => 5089311 'en-id' => 5236201 'en-ko' => 5250857 'en-cs' => 5263312 'en-ro' => 5284228 'en-tr' => 5360692 'en-he' => 5444917 'en-gl' => 5522187 'en-el' => 5523046 'en-hi' => 5543140 'en-th' => 5546225 'en-ta' => 5547128 'en-sq' => 5589814 'en-tl' => 5589976 'en-bn' => 5612051 'en-ml' => 5613020 'en-af' => 5621210 'en-pa' => 5639996 'en-or' => 5657558 'en-nb' => 5671519
The above numbers will change over time, but the biggest change will come from adding new language pairs. Judging by the above number, on average, a language pair will add 3,795,215 rows of floating point numbers, or ≅ 30 MB.
We haven't decided how often to run our models to generate new data, but I'm thinking once every quarter, but no more than once a month. The new data will replace the old data.
- Where's the best place to store the data? Is it the Cassandra cluster maintained by the Analytics team (and used by AQS) or the one used by RESTBase and maintained by the Services team?
- Can the Cassandra cluster used by RESTBase handle such a use case or is it only used as a caching layer?
- Any other concerns?