Page MenuHomePhabricator

Migrate database from SQLite to MySQL
Open, Needs TriagePublic

Description

Context

Image Suggestion API needs to move to production and will require the database to change from SQLite to MySQL

Acceptance Criteria
  • Convert API to use MySQL
  • Convert SQLite data to MySQL dump format
  • Coordinate with DBA to do a 1 time import

Event Timeline

We need to submit a request to the SRE Data Persistence team for the Maria or MySQL instance. More info on what we should ask for here: https://wikitech.wikimedia.org/wiki/MariaDB#Database_creation_template

We need to submit a request to the SRE Data Persistence team for the Maria or MySQL instance. More info on what we should ask for here: https://wikitech.wikimedia.org/wiki/MariaDB#Database_creation_template

FWIW there is T280042: New database request: image_matching, although that task may (not sure about other stakeholders) specify more traffic and storage needs than are needed in the short-to-medium term.

As @kostajh mentioned, T288042: Create ImportFailure concept contained some broad estimates, because it attempted to anticipate how the current PoC values might scale to production.

@sdkim , @MMiller_WMF , what does our roadmap look like now in regards to when/if the PoC might scale up to include all 300+ wikis? Can we simplify this request by having more modest goals that might not meet eventual full deployment requirements, but which look far enough ahead to be worth our colleague's time? We don't want to request something that won't satisfy our medium-term needs, as we'd just have to re-request. But if we're still in a somewhat experimental phase with limited anticipated traffic and limited number of wikis, over-requesting might involve more time and expense than it needs to be.

Some of the concerns in the previous ticket surrounded ingestion and snapshot concerns. Do we have ambitious production-esque SLA requirements at this stage, or can some of those be relaxed? Specifically, can we tolerate some downtime on the order of minutes or even hours while data is being ingested and any ingestion errors are manually resolved? One option is to make the service fail gracefully during ingestion.

If our goals in the relevant timeframe include working out details of the data pipeline, relaxing those things would not be helpful. But if our goals are more narrowly focused around things like reader/editor response to image suggestions and perceived quality of suggestions, maybe a simpler storage request would suffice.

@BPirkle -- thank you for working on this. I can give some info from my perspective, and @kostajh or @Tgr can speak to the ingestion, pipeline, and SLA questions.

  • For our first iteration, it is important to us that we do use an API built by your team (as opposed to loading the data ourselves). The collaboration and interlock between our code is a goal unto itself for this project. That said, we feel fine about the API not being fully featured (as you know).
  • We'll start by deploying Iteration 1 to between 3 and 6 wikis in November. @kostajh or @Tgr might be able to tell us how many API requests per day that might add up to, based on looking at usage for the similar "add a link" workflow.
  • We would want to increase to all, or nearly all, Wikipedias within about six months after the initial deployment. That would put us in June 2022. I think that @sdkim and I estimated that would align with your team's planned work around the next version of the API.

Thanks @MMiller_WMF .

Increasing to (nearly) all Wikipedias within about six months means we'll probably want sufficient capacity for that in the initial request. Or at least, make sure expanding to full capacity is doable in that time frame.

I'm still curious how much downtime we feel we could tolerate and how automated we feel the ingestion would need to be for the first iteration. If we can tolerate enough downtime to load a new dataset (and potentially roll back in case of errors) then we only need to be able to store the data once. If we need very minimal downtime, then we'll need sufficient capacity for snapshots. So our total capacity will be at least 2x the working data size. More if we anticipate needing variants for things like A/B testing in the first iteration.

I'm assuming it wouldn't be that difficult, technically speaking, to make the service recognize that an ingestion was occurring and return an informative message to the client, which could then pass it along to the user in whatever way made the most sense. Requiring the service to be able to respond normally during an ingestion raises the bar in terms of both capacity and performance requirements. I don't know if that's acceptable from a product perspective, though.

We will, of course, want to provide our best estimate on requests per [day|hour|insert-time-frame]. I'm guessing we'll be okay there - the fact that the dataset is read only outside ingestion periods helps a lot - but it is reasonable that we'll need to supply a number with our request. @kostajh , @Tgr , any insights based on your experience with "add a link"?

The Add Link API is used differently so the load it gets won't be informative. We'll request the Image Suggestion API when starting a new suggested edit session, which is roughly equivalent to analytics/legacy/homepagemodule events with action=se-task-click and a link-recommendation task type associated with them, or analytics/legacy/newcomertask events with action=postedit-task-click and the same task type. Here are some stats for that from September:

SELECT
    COUNT(*) count
FROM event.homepagemodule hpm
    LEFT JOIN event.newcomertask nt
        ON STR_TO_MAP(hpm.event.action_data, ';', '=')['newcomerTaskToken'] = nt.event.newcomer_task_token
            AND nt.year=hpm.year
            AND nt.month=hpm.month
WHERE
    hpm.event.module = 'suggested-edits'
    AND hpm.event.action = 'se-task-click'
    AND nt.event.task_type = 'link-recommendation'
    AND hpm.year=2021
    AND hpm.month=9
;

count
4443


SELECT
    COUNT(*) count
FROM event.helppanel hp
    LEFT JOIN event.newcomertask nt
        ON STR_TO_MAP(hp.event.action_data, ';', '=')['newcomerTaskToken'] = nt.event.newcomer_task_token
            AND nt.year=hp.year
            AND nt.month=hp.month
WHERE
    hp.event.action = 'postedit-task-click'
    AND nt.event.task_type = 'link-recommendation'
    AND hp.year=2021
    AND hp.month=9
;

count
5196

So something like 320 requests per day, with the feature enabled on 13 wikis (ar, bn, cs, de, es, fa, fr, hu, nl, pl, ro, ru, vi).

I think a reasonable predictor for API usage is the number of edits done by user accounts which are less than 30 day old. That's 95K for the wikis listed above, and 360K for all Wikipedias. So I would guesstimate something like 1200 requests per day after all deployments are done. This doesn't take into account major changes to the feature such as T240513: Newcomer tasks: entry point in reading experience, which are hard to predict.

A drive-by comment, since T280042 was mentioned - the conversation in that task moved away from MySQL and towards Cassandra (CC'ing @gmodena).

A drive-by comment, since T280042 was mentioned - the conversation in that task moved away from MySQL and towards Cassandra (CC'ing @gmodena).

Thanks for the heads up @LSobanski.

We are indeed prototyping a storage/serving solution for ImageMatching data predicated on Cassandra. There's some initial discussion re datasets and schemas at https://phabricator.wikimedia.org/T293808.