Maniphest T203039

Storage of data for recommendation API
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• bmansurov
	Aug 28 2018, 11:19 PM

Description

In order to take the Recommendation API to production, we're thinking of using Cassandra as a true storage engine (as opposed to a caching layer) for storing model predictions. The amount of data we need to store will change over time, but to start with, we're going to need about 2.2GB of space (greatly simplified).

Storage

According to the article count data per Wikipedia, and top 50 mostly used language pairs in Content Translation, we'll need to store 5,671,519 rows of Wikidata item ID's and 50 columns of double precision floating point numbers for those language pairs. Along with the Wikidata ID (assuming it fits in 8 bytes), 50 floating point numbers for reach row will yield about 2.2 GB for all the rows. Not all language pairs need to store more than 5M rows. For example, while en-nb needs 5,671,519 rows, nn-nb needs 138,664 rows. So most rows will contain 0's for some language pairs. Here's the number of items each language pair needs to store:

'nn-nb' => 138664
'es-pt' => 428172
'pt-es' => 428172
'fr-es' => 570126
'ru-uk' => 682171
'uk-ru' => 682171
'es-ca' => 842702
'ca-es' => 842702
'ru-hy' => 1238608
'ru-kk' => 1254911
'es-gl' => 1277457
'ru-be' => 1325837
'es-ast' => 1350141
'fr-ca' => 1412828
'ru-ba' => 1437744
'en-de' => 3475689
'en-fr' => 3674604
'en-nl' => 3736523
'en-ru' => 4190437
'en-it' => 4226307
'en-es' => 4244730
'es-en' => 4244730
'en-vi' => 4492727
'en-ja' => 4560015
'en-zh' => 4659402
'en-pt' => 4672902
'en-uk' => 4872608
'en-fa' => 5040708
'en-sr' => 5063177
'en-ca' => 5087432
'en-ar' => 5089311
'en-id' => 5236201
'en-ko' => 5250857
'en-cs' => 5263312
'en-ro' => 5284228
'en-tr' => 5360692
'en-he' => 5444917
'en-gl' => 5522187
'en-el' => 5523046
'en-hi' => 5543140
'en-th' => 5546225
'en-ta' => 5547128
'en-sq' => 5589814
'en-tl' => 5589976
'en-bn' => 5612051
'en-ml' => 5613020
'en-af' => 5621210
'en-pa' => 5639996
'en-or' => 5657558
'en-nb' => 5671519

The above numbers will change over time, but the biggest change will come from adding new language pairs. Judging by the above number, on average, a language pair will add 3,795,215 rows of floating point numbers, or ≅ 30 MB.

Update frequency

We haven't decided how often to run our models to generate new data, but I'm thinking once every quarter, but no more than once a month. The new data will replace the old data.

Cassandra

We have a service called recommendation-api that's already deployed to production. The new data will be exposed as new end points, e.g. WIP patch.

Questions

Where's the best place to store the data? Is it the Cassandra cluster maintained by the Analytics team (and used by AQS) or the one used by RESTBase and maintained by the Services team?
Can the Cassandra cluster used by RESTBase handle such a use case or is it only used as a caching layer?
Any other concerns?

Related Objects
Search...

Status	Assigned	Task
Resolved	• bmansurov	T171227 [Objective 9.1.4] API and GapFinder prototyping and testing
Resolved	• schana	T148129 Productization of Recommendation API
Resolved	• bmansurov	T201192 Build API to surface 'morelike' article recommendations for missing articles
Resolved	• bmansurov	T203039 Storage of data for recommendation API
Resolved	jcrespo	T205294 Request to create database and account for recommendation API

Event Timeline

• bmansurov created this task.Aug 28 2018, 11:19 PM

@bmansurov Please consult with services team (now part of Core platform) as data used to power features on mediawiki is stored in service's cluster. Seems that storage requirements are quite meager so storage-wise I do not think space is an issue, maybe you also need to think about usage in that when the service is in production how much data will it come from cassandra directly versus the caching layer (varnish). Given that data needs to be fully reloaded every quarter it it is worth thinking how reloading of data will interact with the live product.

leila added a parent task: T203041: Output 2.1: An improved task recommendation API.Aug 28 2018, 11:46 PM

The new storage component that's being designed now (the successor of RESTBase) will be the perfect place for that, but while it's being built, we can simply put it into the existing RESTBase Cassandra cluster. Given that the data will be pretty static, I don't expect it to create a lot of load or compaction activity.

However, we didn't plan for this increase in the storage volume, ven though 2.2 Gigs doesn't sound too bad.

How will the data be regenerated? Do you plan on running some background script from time to time and bulk load the new data into Cassandra? If so we can use a trick we use for mass deletions and make the data rebuilding script write SSTables directly and then stream them into Cassandra further decreasing the load.

Which queries will you perform on the data? Would you need to fetch the numbers for all pairs in a wikidata item, or a single pair only?

Anyway, the schema like this should probably work:

CREATE TABLE IF NOT EXISTS recommendation_data (
  "_domain": "string",
  "wikibase_id": "integer",
  "language_pair": "string",
  "value": "double",
  PRIMARY KEY (("_domain", "wikibase_id"), "language_pair")
)

In T203039#4540767, @Pchelolo wrote:

However, we didn't plan for this increase in the storage volume, ven though 2.2 Gigs doesn't sound too bad.

That's actually 2.2 * 3 == 6.6 GB accounting for replication, but yeah, still pretty low and manageable.

How will the data be regenerated? Do you plan on running some background script from time to time and bulk load the new data into Cassandra? If so we can use a trick we use for mass deletions and make the data rebuilding script write SSTables directly and then stream them into Cassandra further decreasing the load.

Correct. They statically collect the info periodically, making the bulk-insert script ideal for this sort of activity.

Which queries will you perform on the data? Would you need to fetch the numbers for all pairs in a wikidata item, or a single pair only?

I suppose that, since it will act as a recommendation tool, it will recommend actions based on the user's edit history and the current item in question. Since people can edit more than two languages, I would guess that separating the items in the language pairs would be beneficial, so that queries like "get me all pairs for a given item and language" work. But, I'll let @bmansurov correct me.

To gauge the growth trend, @bmansurov could you explain shortly how new data rows would come to existence? Concretely, what influences the number of rows for each language pairs? Are they composed only of items that exist in one language, but not the other? What causes new language pairs to be added?

@Nuria thanks for the comment. I'll remove the Analytics tag from the task.

@Pchelolo

How will the data be regenerated?

We have Spark scripts that generate this data. The data we want to import looks like this (language pairs are in separate files there).

Do you plan on running some background script from time to time and bulk load the new data into Cassandra? If so we can use a trick we use for mass deletions and make the data rebuilding script write SSTables directly and then stream them into Cassandra further decreasing the load.

I'm not sure of the best way to do this. Importing some data in MySQL took a really long time (bulk imports failed as well). If what you're suggesting is the way to go, then let's do it.

Which queries will you perform on the data? Would you need to fetch the numbers for all pairs in a wikidata item, or a single pair only?

We currently have two use cases for this data (we may have more in the future):

For a given list of Wikidata IDs we want to select one or more language pairs that have the same target language. If a language pair is 'en-ko', then the source language is 'en', and the target language is 'ko'. We then want to combine these results into a list and sort Wikidata IDs by the prediction score ('value' in your schema). This use case is mainly derived from the current Content Translation use case.
For a given language pair we want to find the top, say 1000, Wikidata IDs by the prediction score.

I don't think we want to fetch all language pairs at once, not for now at least.

@mobrovac

I suppose that, since it will act as a recommendation tool, it will recommend actions based on the user's edit history and the current item in question.

Currently, we're not taking user's edit history into account, but that's something we may do in the future. In this iteration we're taking Wikidata sitelinks and article pageviews across top 50 Wikipedias in generating these predictions. The details are in the "Features" subsection of section 2.2 of the paper.

Since people can edit more than two languages, I would guess that separating the items in the language pairs would be beneficial, so that queries like "get me all pairs for a given item and language" work.

Yes, the main reason we keep language pairs separate is that we have a use case for that (see #1 above).

To gauge the growth trend, @bmansurov could you explain shortly how new data rows would come to existence? Concretely, what influences the number of rows for each language pairs? Are they composed only of items that exist in one language, but not the other?

Correct, rows are the articles (more precisely, the Wikidata ID's of articles) that exist only in the source language. For the 'en-ko' language pair it would be the articles that exist on enwiki and not kowiki.

What causes new language pairs to be added?

Mostly, demand by users. Currently we capture demand by looking at the top 50 language pairs used in ContentTranslation. Maybe we'll let people request language pairs by creating Phabricator tickets. Maybe something else.

• bmansurov updated the task description. (Show Details)Aug 29 2018, 1:05 PM

For a given list of Wikidata IDs we want to select one or more language pairs that have the same target language. If a language pair is 'en-ko', then the source language is 'en', and the target language is 'ko'. We then want to combine these results into a list and sort Wikidata IDs by the prediction score ('value' in your schema). This use case is mainly derived from the current Content Translation use case.

Hm... this one might be difficult to query in Cassandra. As I understand it's now handled by MySQL? Is there a place where I can see the current SQL query for this?

If not, do I understand correctly the parameters of the query is List of Wikidata IDs and the target language, and the result is what? A list of (source_lang, wikidata_id) tuples. sorted in the order of value?

Hm... this one might be difficult to query in Cassandra. As I understand it's now handled by MySQL? Is there a place where I can see the current SQL query for this?

MySQL schema looks like this and the query is on line 170 of this file.

If not, do I understand correctly the parameters of the query is List of Wikidata IDs and the target language, and the result is what? A list of (source_lang, wikidata_id) tuples. sorted in the order of value?

Yes, the parameters are a list of Wikidata IDs along with the target language, while the result is a sorted list of (Wikidata ID, value) tuples ordered by value.

Yes, the parameters are a list of Wikidata IDs along with the target language, while the result is a sorted list of (Wikidata ID, value) tuples ordered by value.

So how's the source language counts into that? If we have both en-es and fr-es with really high scores - will the ID be presented twice within the query result?

In T203039#4542783, @Pchelolo wrote:

Yes, the parameters are a list of Wikidata IDs along with the target language, while the result is a sorted list of (Wikidata ID, value) tuples ordered by value.

So how's the source language counts into that? If we have both en-es and fr-es with really high scores - will the ID be presented twice within the query result?

Yes, that's correct. For a given ID and target language we want to get the highest score regardless of the source language. In the future we may want to do some averaging.

Ok, so these queries don't quite fit into Cassandra data model since normally you can only order by the clustering key and in order to do that efficiently, all the preceding portions of a compound key must be specified. Also, if we want decent partitioning of the data, the wikidata ID should probably be a part of the partitioning key and it's the most unique piece of data, so it will provide the best distribution. But, including it in a partitioning key means, we need to specify it for every query - so your queries will not work.

Materialized views allow you to 'restructure' the data so that it's denormalized in a way that your particular query will be efficient, but it obviously adds some overheads. We need to look into the overheads and see whether it worths using Cassandra for this or now.

So, the following data model would provide efficient queries like you want, but I'm not particularly sure how efficient it will be regarding partitioning:

create table recommend (
  qid text, 
  source text, 
  target text, 
  value double,  
  primary key ((qid, source, target))
);

create materialized view top_for_lang_pair 
as select qid, value from recommend 
where qid is not null and source is not null and target is not null and value is not null 
primary key ((source, target), value, qid) 
with clustering order by (value desc);

create materialized view top_for_target_lang 
as select qid, source, value from recommend 
where qid is not null and source is not null and target is not null and value is not null 
primary key (source, value, qid target) 
with clustering order by (value desc);

@Eevans any thoughts on this?

Hm, disregard my previous comment, the top_for_target_lang view will not allow to limit by wikidata IDs with global ordering by value and having some limits, so the re-ordering will need to be done in code, which's bad..

In T203039#4542081, @bmansurov wrote:

@Pchelolo

[ ... ]

Which queries will you perform on the data? Would you need to fetch the numbers for all pairs in a wikidata item, or a single pair only?

We currently have two use cases for this data (we may have more in the future):

For a given list of Wikidata IDs we want to select one or more language pairs that have the same target language. If a language pair is 'en-ko', then the source language is 'en', and the target language is 'ko'. We then want to combine these results into a list and sort Wikidata IDs by the prediction score ('value' in your schema). This use case is mainly derived from the current Content Translation use case.

OK, this one would be pretty easy to model:

CREATE TABLE by_id_target (
    qid text,
    src text,
    target text,
    score double,
    PRIMARY KEY (qid, target, src)
);

SELECT src,score FROM by_id_target WHERE qid IN ('Q001', 'Q002') AND target = 'ko';

With this, sorting the ids by prediction score is something you'd do over the results in-code. I'm assuming the result set is small enough to make this practical, (and that this is what you had in mind based on the way you worded it).

For a given language pair we want to find the top, say 1000, Wikidata IDs by the prediction score.

OK, this one is a little trickier; The naive solution would look something like:

CREATE TABLE by_score (
    src text,
    target text,
    score double,
    qid text,
    PRIMARY KEY ((src, target), score)
) WITH CLUSTERING ORDER BY (score DESC);

SELECT qid, score FROM by_score WHERE src = 'en' AND target = 'ko' LIMIT 1000;

The problem with this is that we are partitioning on language pair, and the set of all (proposed) language pairs is quite small, while the number of qid/scores is large, so this doesn't distribute well. It may be OK though, and I suspect there is room to improve, so I'll avoid prematurely optimizing until I'm sure I understand the problem well enough.

And speaking of understanding the problem: In case I missed it, what is it that makes us want to use Cassandra for this? Other than being able to answer the queries cited, what are the properties we need? For example: Level of fault tolerance, latency, a need for master-master replication, data set size/throughput, etc.

There was some mention of the time it took to import the dataset into MySQL, does that mean MySQL was considered? If so, was this the extent of the problems; Why is it being excluded?

@Eevans

With this, sorting the ids by prediction score is something you'd do over the results in-code. I'm assuming the result set is small enough to make this practical, (and that this is what you had in mind based on the way you worded it).

Yes, we'll probably have around 100 results to sort through.

And speaking of understanding the problem: In case I missed it, what is it that makes us want to use Cassandra for this? Other than being able to answer the queries cited, what are the properties we need? For example: Level of fault tolerance, latency, a need for master-master replication, data set size/throughput, etc.

MySQL was considered during the development of the API end points here. I'd be happy to go the MySQL route if we had some way to access MySQL from the service. In this patch @mobrovac mentioned that we can't do so yet. That's why we want to use Cassandra for this. Since we don't know about the potential API usage (except for the current usage in ContentTranslation), I wasn't thinking too much about the features you mentioned.

There was some mention of the time it took to import the dataset into MySQL, does that mean MySQL was considered? If so, was this the extent of the problems; Why is it being excluded?

MySQL was only considered during development and testing. Import being slow was not the issue as we could have sped it up somehow. The issue was that we wouldn't be able to use MySQL once the service hit production.

In T203039#4543773, @bmansurov wrote:

@Eevans

[ ... ]

And speaking of understanding the problem: In case I missed it, what is it that makes us want to use Cassandra for this? Other than being able to answer the queries cited, what are the properties we need? For example: Level of fault tolerance, latency, a need for master-master replication, data set size/throughput, etc.

MySQL was considered during the development of the API end points here. I'd be happy to go the MySQL route if we had some way to access MySQL from the service. In this patch @mobrovac mentioned that we can't do so yet. That's why we want to use Cassandra for this. Since we don't know about the potential API usage (except for the current usage in ContentTranslation), I wasn't thinking too much about the features you mentioned.

Like I said, I suspect we can accommodate this with Cassandra, but we should definitely think this through to be safe. As you can see from the schema I posted, what we're basically doing is working backward by identifying the query, precomputing the results, and storing them. This is pretty typical when modeling on Cassandra, and in addition to the duplication, it means that if you need different queries down the road, we may need to revisit this and store yet another representation (MySQL is much better for arbitrarily querying normalized data).

There was some mention of the time it took to import the dataset into MySQL, does that mean MySQL was considered? If so, was this the extent of the problems; Why is it being excluded?

MySQL was only considered during development and testing. Import being slow was not the issue as we could have sped it up somehow. The issue was that we wouldn't be able to use MySQL once the service hit production.

I'm not sure what the limitation is there, or what would be required to overcome it, but if we use Cassandra because of this, we should be able to cite this as the reason.

Eevans added a subscriber: Joe.Aug 29 2018, 10:21 PM

• bmansurov moved this task from Backlog to In Progress on the Research board.Aug 31 2018, 4:14 PM

Data for 50 language pairs is ready. Services, please let me know what the next steps are.

In T203039#4549477, @bmansurov wrote:

Data for 50 language pairs is ready. Services, please let me know what the next steps are.

I think we should step back, look at what our requirements are, weigh them against the technologies we have that might be a fit, and then tackle any operational concerns (if necessary, to implementing storage).

To summarize, we have a dataset (currently) covering 50 language pairings, where for each we have as many as ~5.6M IDs (strings) and associated scores (doubles). The data is static, generated as infrequently as once per quarter, or as frequent as monthly. Each (re)generated dataset will replace (overwrite) the existing one.

There are currently two types of queries over the dataset that need to be answered:

For a given list of Wikidata IDs we want to select one or more language pairs that have the same target language. If a language pair is 'en-ko', then the source language is 'en', and the target language is 'ko'. We then want to combine these results into a list and sort Wikidata IDs by the prediction score ('value' in your schema). This use case is mainly derived from the current Content Translation use case.

and

For a given language pair we want to find the top, say 1000, Wikidata IDs by the prediction score.

Other queries may be need to be supported in the future.

Questions

How do we expect the number of language pairings to grow/change over time?
How do we expect the number of IDs per pairing to grow/change over time?
How likely is it that additional queries will be added with time?
- Is there any insight now as to what those might be?
What is the expected read rate?
- By query?
What sort of SLA must be maintained?
- Latency expectations?
- Availability? What defines an outage? How long should outages be bounded to? In other words, what do we need to be resilient to? Device failure, host failure, data-center failure?
- Durability/redundancy? How bad is a complete loss of data (worst-case, regenerating the dataset)?
Bulk importing:
- Manual, automatic?
- From where will the data be generated/loaded?
- Throughput?
What is your timeline for moving to production?

@Eevans thanks for the summary. Here are my answers:

How do we expect the number of language pairings to grow/change over time?

We don't have any mechanism for calculating this now. Once the service hits production and users find out about it, we expect to have some requests from users to add more language pairs. That said, I doubt we'll have to add more than a handful of languages in the next 6 month if any.

How do we expect the number of IDs per pairing to grow/change over time?

Let's take en-es as the language pair. If enwiki adds 100 new articles which are not present on eswiki, then we'll add 100 new IDs too. However, if eswiki creates 100 pages that were missing on eswiki but present on enwiki, then the number of IDs will decrease by 100. So the change depends on how fast articles are being created in the source and target languages.

How likely is it that additional queries will be added with time?

It's likely that we'll add more queries later. I don't think it will be in the next 6 months though. We have to take care of making the recommendation models more robust before expanding the API.

Is there any insight now as to what those might be?

No, we don't know what queries those may be yet.

What is the expected read rate?

The existing recommendation API was hit 10,942 times in August, 2018. This number is 12,187 for July, 2018. We expect to have a higher number of reads with the new API because the results it's returning seem to be better than the existing API results (T203263).

By query?

Approximately, 50K requests per query type over one month. (50K is somewhat arbitrary and is derived from the 10,942 number above.)

What sort of SLA must be maintained?

We don't have any special requirements for this. We expect the same sort of SLA that you have with other API end points in production.

Latency expectations?

Under a second would be nice. I think users can tolerate this much or a little more.

Availability? What defines an outage? How long should outages be bounded to? In other words, what do we need to be resilient to? Device failure, host failure, data-center failure?

Only ContentTranslation is using this API, so we're not too worried about outages at this stage, although they are highly undesirable of course. As the API gets a wider adoption we'd like to be resilient to any kind of failure. We can mark the API as alpha/beta quality to begin with.

Durability/redundancy? How bad is a complete loss of data (worst-case, regenerating the dataset)?

The data is read-only, so no worries if all of data is lost (as opposed to some of it). However, we'd like to be able to run import script as soon as the data loss happens so that we can continue serving our users. We'll have the import data ready. That said, it would be problematic if some pieces of data are lost and we're not aware of that. The API outcome directly depends on existence of all data because otherwise we'd be suggesting articles for creation that may not be as important as the missing articles (from data loss).

Bulk importing:
Manual, automatic?

We'd prefer automatic. We need your guidance in setting this up.

From where will the data be generated/loaded?

The data currently lives on stat1005 (and is generated on the Analytics cluster), but we can put it anywhere that's convenient for you.

Throughput?

High throughput as we don't want to wait more than a few minutes for a few GB of data to be imported.

What is your timeline for moving to production?

We'd like to go live as soon as possible. Storage is our only blocker at this stage.

On another note, my calculation of the data size mentioned in the task description didn't take repeat Wikidata ID's into account. I was expecting to store each ID only once for all language pairs, i.e. each row would be ID, 1st language pair, 2nd language pair, etc. However, judging by your schema, we'll have to save ID's separately for each entry, which would increase the required storage size.

OK, so the dataset size and query rate are quite low and expected to remain so, and the latency and availability expectations aren't particularly onerous. Pretty much anything that can durably store structured data can accommodate these requirements.

This can be modeled in Cassandra, but doing so will require it to be denormalized, and each new query will (probably) require further duplication (again, we're modeling our queries). The data model is a much better fit for a relational database like MySQL/MariaDB where it can be normalized, and indexed in support of the queries you need here. I think we should establish what technical limitations exist in getting space to host this in a MariaDB database, and access from the SCB cluster, so that we can at least weigh the costs. Maybe @jcrespo can point us in the right direction?

If we do host this in Cassandra, I wonder if its worth entertaining whether the AQS cluster is a better fit. They're doing something quite similar, batch importing a read-only dataset (from stat100* machines too, I believe). I don't know what this would mean for hosting of the API though. Any thoughts on this @Nuria?

fgiunchedi subscribed.Sep 11 2018, 8:41 AM

mark added projects: DBA, SRE.Sep 11 2018, 4:39 PM

mark added subscribers: • Marostegui, • Banyek, mark.

AIUI, the reason why we're not using MySQL (which would probably fit this storage model as well, if not better than cassandra) is just that we don't have libraries and abstractions for accessing MySQL from our nodejs services. Is that correct?

don't have libraries and abstractions for accessing MySQL from our nodejs services. Is that correct?

That's the easy part, node has great support for MySQL with an abundance of drivers and frameworks.

I believe the issue is that we don't have access to MySQL from node services and, honestly, I don't think we should have access to the main DB from node services, at least not for this use-case. If we had some other MySQL cluster that would be the best option.

Another consideration is that here we're distributing pre-learned AI model, I believe there should be industry standards or best practices on how to deploy such data, it's not my area of expertise though. @bmansurov are you aware of any?

If we had some other MySQL cluster that would be the best option

2.2GB of data is a ridiculous small amount of data, and it would fit comfortably in one of our misc clusters/separate set of instances inside the misc databases with no effort. We can purchase a dedicated cluster if needed, depending on how production-y this is (misc clusters have a slightly lower SLA than core servers).

The good thing about MySQL is that, assuming the data is public, you can import right now that into toolsdb (inside cloud) for testing purposes and setup test software there without any dependency on admins, and then reproduce that on production.

In T203039#4574768, @Pchelolo wrote:

don't have libraries and abstractions for accessing MySQL from our nodejs services. Is that correct?

That's the easy part, node has great support for MySQL with an abundance of drivers and frameworks.

I believe the issue is that we don't have access to MySQL from node services and, honestly, I don't think we should have access to the main DB from node services, at least not for this use-case. If we had some other MySQL cluster that would be the best option.

Another consideration is that here we're distributing pre-learned AI model, I believe there should be industry standards or best practices on how to deploy such data, it's not my area of expertise though. @bmansurov are you aware of any?

Ok it seems clear to me that this project will need its own database; we have servers that we use for non-MediaWiki uses and we surely can add capacity there.

Honestly, MySql seems the natural, cheap candidate for this kind of use case, where the distributed-master advantage Cassandra is surely not needed. On the other hand, MySQL gains us a lot of flexibility and is surely cheaper.

We should finally break this myth then: it's perfectly ok to use MySql as a storage engine for services,they just won't be the databases that serve MediaWiki.

jcrespo moved this task from Triage to Meta/Epic on the DBA board.Sep 14 2018, 1:02 PM

In T203039#4574768, @Pchelolo wrote:

Another consideration is that here we're distributing pre-learned AI model, I believe there should be industry standards or best practices on how to deploy such data, it's not my area of expertise though. @bmansurov are you aware of any?

Since we're doing batch scoring (as opposed to real-time scoring), luckily we don't have to worry about generating models, saving them for production use, and loading them up in a production environment to make predictions. This process seems to be used in the industry right now where applicable.

At this stage it makes sense for us to use MySQL as we aren't expecting too many requests to the API. Down the road we may try putting a caching layer in front of MySQL. I also agree with the above comments about how MySQL maybe a better fit than Cassandra.

An important caveat to this is that we're able to import data into MySQL fast. I'm relying on a DBA to guide me in the process.

http://dbahire.com/testing-the-fastest-way-to-import-a-table-into-mysql-and-some-interesting-5-7-performance-results/

Is 250K rows inserted per second fast enough? https://dbahire.com/testing-again-load-data-on-mysql-5-6-5-7-8-0-non-ga-and-mariadb-10-0-10-1-and-10-2-non-ga/ (I got those numbers on my 5-year old pc) or will you need more tuning? :-D You can download the sample code here: http://dbahire.com/testing-the-fastest-way-to-import-a-table-into-mysql-and-some-interesting-5-7-performance-results/ and of course we help people precisely with this kind of work. If batch import speed is important, rather read throughput, we can go in a classic master-replica async replication topology with automatic failover (as we have now with misc services) rather than a galera-like cluster.

You mention 2.2GB, I can give you that right now on m2 section, which is underutilized, and later if needed, purchase a dedicated set of servers with a dedicated instance, depending on the growth and usage perspectives.

@jcrespo 250K rows/sec sounds great. Batch import speed per se is not too important — I just don't want to wait hours to load data up like I did in a labs instance. And yes, starting with m2 section looks like a good idea.

@jcrespo anything else blocking us from importing data to the database? Any documentation on connecting to the database from the services?

@Pchelolo where would database settings live? Would it be the service codebase itself or do we have a separate repository for that?

anything else blocking us from importing data to the database?

There is no formal request yet, you need to create a ticket to #DBAs to ask to create a database and an account, including expected QPS and size with a database name and a user.

• bmansurov mentioned this in T205294: Request to create database and account for recommendation API.Sep 24 2018, 3:32 PM

@Pchelolo where would database settings live? Would it be the service codebase itself or do we have a separate repository for that?

Usually the source of truth for such settings is puppet, which are then used in vars.yaml in the deploy repo of your service and gets transcluded into the config. We can chat more about this when you have the actual DB setup

@Pchelolo the database has been setup (T205294). I think this task is complete as far as storage is concerned. I'll create another task for setting up access to the database from a service.

Thanks everyone who helped move this task forward!

jcrespo closed subtask T205294: Request to create database and account for recommendation API as Resolved.Sep 25 2018, 3:46 PM

• bmansurov mentioned this in T205452: Setup access from service to mysql.Sep 25 2018, 3:53 PM

• bmansurov mentioned this in T207756: Beta labs: Request to create database and account for recommendation API.Oct 23 2018, 3:14 PM

• bmansurov mentioned this in T207795: Create the recommendation api DB in Beta.Oct 23 2018, 9:12 PM

• bmansurov edited parent tasks, added: T201192: Build API to surface 'morelike' article recommendations for missing articles; removed: T203041: Output 2.1: An improved task recommendation API.Dec 6 2018, 11:26 PM

• bmansurov mentioned this in T208622: Import recommendations into production database.Dec 11 2018, 5:12 PM

Ottomata added a project: Analytics.Dec 11 2018, 5:18 PM

• bmansurov mentioned this in T212154: Recommendation API exceeds max_user_connections in MySQL.Jan 4 2019, 3:52 PM