Page MenuHomePhabricator

Add Link engineering: Convert mwaddlink to read/write to MySQL instead of SQLite
Closed, ResolvedPublic

Description

Per discussion in T258978, using SQLite is going to be problematic for the production service. So instead,

  1. mwaddlink needs to create and then write to MySQL tables instead of SQLite files during the model training phase
  2. mwaddlink needs to read from MySQL tables instead of SQLite when responding to queries

Remaining tasks:

  • Training pipeline to write directly to MySQL (?)

Event Timeline

I'll try to add some more depth to this. tl;dr there are two problems we are trying to solve: 1) we currently use SQLite files for loading data needed by the query service but we can't use files in production and 2) we need to keep RAM usage low.


The link recommendation service has a training component, where we generate datasets on the stats machines. There are 5 datasets per language (anchors, page IDs, redirects, word2vec, nav2vec) and they are all key/value datasets that are stored in SQLite files.

In the query component of the link recommendation service, when we try to get recommended links for an article, the code loads those datasets as python dictionaries using SqliteDict:

anchors = SqliteDict("./data/{0}/{0}.anchors.sqlite".format(lang))
page_ids = SqliteDict("./data/{0}/{0}.pageids.sqlite".format(lang))
redirects = SqliteDict("./data/{0}/{0}.redirects.sqlite".format(lang))
word2vec = SqliteDict("./data/{0}/{0}.w2v.filtered.sqlite".format(lang))
nav2vec = SqliteDict("./data/{0}/{0}.nav.filtered.sqlite".format(lang))

Creating a new SqliteDict object basically opens a connection to the SQLite table but does not do any queries and as a result it doesn't require resources in terms of CPU/RAM.

The SqliteDict object extends the DictClass, and overrides core methods of that class:

def iterkeys(self):
    GET_KEYS = 'SELECT key FROM "%s" ORDER BY rowid' % self.tablename
    for key in self.conn.select(GET_KEYS):
        yield key[0]

def itervalues(self):
    GET_VALUES = 'SELECT value FROM "%s" ORDER BY rowid' % self.tablename
    for value in self.conn.select(GET_VALUES):
        yield self.decode(value[0])

def iteritems(self):
    GET_ITEMS = 'SELECT key, value FROM "%s" ORDER BY rowid' % self.tablename
    for key, value in self.conn.select(GET_ITEMS):
        yield key, self.decode(value)

That allows our code in research/mwaddlink to do things like: for anchor in anchors, and the SqliteDict class runs sqlite queries behind the scenes to make iterating over the dictionary possible.

If we switch to a MySQL backend we need to make a new class that extends UserDict and overrides methods to make MySQL queries instead.

But given that these are all key/value lookups, do we want MySQL or should we be looking at some other tool (Cassandra for example)?

(cc @Marostegui and @Joe on that last comment, we can discuss in our meeting later today.)

Actually, it looks like the key methods overridden by SqliteDict are

def __contains__(self, key):
    HAS_ITEM = 'SELECT 1 FROM "%s" WHERE key = ?' % self.tablename
    return self.conn.select_one(HAS_ITEM, (key,)) is not None

def __getitem__(self, key):
    GET_ITEM = 'SELECT value FROM "%s" WHERE key = ?' % self.tablename
    item = self.conn.select_one(GET_ITEM, (key,))
    if item is None:
        raise KeyError(key)
    return self.decode(item[0])

So, our class would extend UserDict, override those two methods, and use a MySQL connection instead of the SQLite one.

@kostajh I thought we had a template, but I cannot really find it.
Once you are ready to request a new database please create a new task with DBA and with the following fields filled if you know them

QPS:
Size: (An estimation is fine)
DB Name:
User:
Accessed from server (s):
Backup Policy: Needed? Frecuency?
Grants needed:

kostajh added a subscriber: MGerlach.

@kostajh I thought we had a template, but I cannot really find it.
Once you are ready to request a new database please create a new task with DBA and with the following fields filled if you know them

QPS:
Size: (An estimation is fine)
DB Name:
User:
Accessed from server (s):
Backup Policy: Needed? Frecuency?
Grants needed:

Got it, thanks @Marostegui

I wonder if the staging database would be good here?

stat1008:~$ analytics-mysql staging

mysql:research@dbstore1005.eqiad.wmnet [staging]> show tables;

+-----------------------------------------------------------------+
| Tables_in_staging                                               |
+-----------------------------------------------------------------+
...
| halfak_eligible_user                                            |
...
| nettrom_creations_from_page                                     |
| nettrom_creations_from_page_sources                             |
| nettrom_creations_from_revision                                 |
| nettrom_creations_from_revision_sources                         |

@nettrom_WMF I tried to create a table but was denied. Is there a special request process for getting tables in staging?

I wonder if the staging database would be good here?

stat1008:~$ analytics-mysql staging

mysql:research@dbstore1005.eqiad.wmnet [staging]> show tables;

+-----------------------------------------------------------------+
| Tables_in_staging                                               |
+-----------------------------------------------------------------+
...
| halfak_eligible_user                                            |
...
| nettrom_creations_from_page                                     |
| nettrom_creations_from_page_sources                             |
| nettrom_creations_from_revision                                 |
| nettrom_creations_from_revision_sources                         |

@nettrom_WMF I tried to create a table but was denied. Is there a special request process for getting tables in staging?

Hmm, possibly? Just noting, at launch of this feature we will have 4 language wikis (plus test wiki, so 5) and each of them will have 5 datasets (each dataset is a table), so that will be 25 tables. Sometime in Q3 we would expand this to 20 wikis, so we'd end up with 125 tables in that staging DB (hhich currently has 153).

Hmm, possibly? Just noting, at launch of this feature we will have 4 language wikis (plus test wiki, so 5) and each of them will have 5 datasets (each dataset is a table), so that will be 25 tables. Sometime in Q3 we would expand this to 20 wikis, so we'd end up with 125 tables in that staging DB (hhich currently has 153).

Ah, okay! Yeah that's a good point, you wouldn't want to use staging for that.

Also @nettrom_WMF oops I mixed up being denied database creation with table creation. So nevermind :)

To elaborate on my last comment: the staging database might actually be fine for local testing of the research/mwaddlink dataset output; I could imagine @MGerlach using a fixed number of tables there (5) for training a model and verifying the datasets.

But whatever we end up using for the pipeline (T266826) is I imagine going to need something different than the staging DB.

If we end up with anything more complicated / unusual than MySQL, it should be configurable (since the interface is just a dict field access, it should be easy) so we can switch back to sqlite or something similar in developer environments where we probably don't want to bother installing Cassandra just for doing key-value lookups. The same might even go for MySQL, although for MediaWiki developers it's ubiquitous anyway but maybe it isn't for everyone else.

Change 638484 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Add support for querying from MySQL

https://gerrit.wikimedia.org/r/638484

Change 638677 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Implement MySQL read/write support

https://gerrit.wikimedia.org/r/638677

Change 638484 merged by jenkins-bot:
[research/mwaddlink@main] Add support for querying from MySQL

https://gerrit.wikimedia.org/r/638484

Change 638677 merged by jenkins-bot:
[research/mwaddlink@main] Implement MySQL read/write support

https://gerrit.wikimedia.org/r/638677

Change 639789 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Add mariadb-client

https://gerrit.wikimedia.org/r/639789

Change 639789 merged by jenkins-bot:
[research/mwaddlink@main] Add mariadb-client

https://gerrit.wikimedia.org/r/639789

I think it probably makes sense to continue writing to SQLite files, then reading from and copying into MySQL tables.

We need to integrate this step into the run-pipeline.sh script, though.

We also need to look into some edge cases where insertions fail due to what look like character encoding issues.

I think it probably makes sense to continue writing to SQLite files, then reading from and copying into MySQL tables.

Agree.

We need to integrate this step into the run-pipeline.sh script, though.

I can add this to the script.

We also need to look into some edge cases where insertions fail due to what look like character encoding issues.

When we copy, I can make a quick check and count how often this happens for a few languages. If the number is not too large, this should not affect the performance too much except that we might miss a few recommendations.

Change 643489 had a related patch set uploaded (by Kosta Harlan; owner: MGerlach):
[research/mwaddlink@main] Add read_default_file as additional argument for mysql-read/write access

https://gerrit.wikimedia.org/r/643489

Change 643489 merged by jenkins-bot:
[research/mwaddlink@main] Add read_default_file as additional argument for MySQL access

https://gerrit.wikimedia.org/r/643489

We also need to look into some edge cases where insertions fail due to what look like character encoding issues.

When we copy, I can make a quick check and count how often this happens for a few languages. If the number is not too large, this should not affect the performance too much except that we might miss a few recommendations.

I tried with data from trained models of 7 languages (ar, cs, de, ko, pt, simple, vi) adding tables to mysql; errors from bad formatting happens but is rare (100 or less cases out of millions across languages; one outlier is the redirects table for ar with 2144 exceptions, which is still small compared to the overall number of entries (~1M). thus I dont think this will have major impact on the performance of the algorithm. we should keep in mind and fix at some point but not urgent.

Change 643911 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Load all anchor keys before checking for mention presence

https://gerrit.wikimedia.org/r/643911

Change 643911 merged by jenkins-bot:
[research/mwaddlink@main] Log additional query detail

https://gerrit.wikimedia.org/r/643911

I think it probably makes sense to continue writing to SQLite files, then reading from and copying into MySQL tables.

Agree.

added data from SQLite files and model into MySQL tables (staging database) for 7 languages (ar, cs, de, ko, pt, simple, vi).

We need to integrate this step into the run-pipeline.sh script, though.

I can add this to the script.

I added the step to the training-pipeline in run-pipeline.sh
I encountered an issue in making sure the link recommendation uses the updated model after re-running the training-pipeline: T268901

I think the two items in the task description are done, and the training pipeline work is in T266826.