Context
Image Suggestion API needs to move to production and will require the database to change from SQLite to MySQL
Acceptance Criteria
- Convert API to use MySQL
- Convert SQLite data to MySQL dump format
- Coordinate with DBA to do a 1 time import
Image Suggestion API needs to move to production and will require the database to change from SQLite to MySQL
We need to submit a request to the SRE Data Persistence team for the Maria or MySQL instance. More info on what we should ask for here: https://wikitech.wikimedia.org/wiki/MariaDB#Database_creation_template
FWIW there is T280042: New database request: image_matching, although that task may (not sure about other stakeholders) specify more traffic and storage needs than are needed in the short-to-medium term.
As @kostajh mentioned, T288042: Create ImportFailure concept contained some broad estimates, because it attempted to anticipate how the current PoC values might scale to production.
@sdkim , @MMiller_WMF , what does our roadmap look like now in regards to when/if the PoC might scale up to include all 300+ wikis? Can we simplify this request by having more modest goals that might not meet eventual full deployment requirements, but which look far enough ahead to be worth our colleague's time? We don't want to request something that won't satisfy our medium-term needs, as we'd just have to re-request. But if we're still in a somewhat experimental phase with limited anticipated traffic and limited number of wikis, over-requesting might involve more time and expense than it needs to be.
Some of the concerns in the previous ticket surrounded ingestion and snapshot concerns. Do we have ambitious production-esque SLA requirements at this stage, or can some of those be relaxed? Specifically, can we tolerate some downtime on the order of minutes or even hours while data is being ingested and any ingestion errors are manually resolved? One option is to make the service fail gracefully during ingestion.
If our goals in the relevant timeframe include working out details of the data pipeline, relaxing those things would not be helpful. But if our goals are more narrowly focused around things like reader/editor response to image suggestions and perceived quality of suggestions, maybe a simpler storage request would suffice.
@BPirkle -- thank you for working on this. I can give some info from my perspective, and @kostajh or @Tgr can speak to the ingestion, pipeline, and SLA questions.
Thanks @MMiller_WMF .
Increasing to (nearly) all Wikipedias within about six months means we'll probably want sufficient capacity for that in the initial request. Or at least, make sure expanding to full capacity is doable in that time frame.
I'm still curious how much downtime we feel we could tolerate and how automated we feel the ingestion would need to be for the first iteration. If we can tolerate enough downtime to load a new dataset (and potentially roll back in case of errors) then we only need to be able to store the data once. If we need very minimal downtime, then we'll need sufficient capacity for snapshots. So our total capacity will be at least 2x the working data size. More if we anticipate needing variants for things like A/B testing in the first iteration.
I'm assuming it wouldn't be that difficult, technically speaking, to make the service recognize that an ingestion was occurring and return an informative message to the client, which could then pass it along to the user in whatever way made the most sense. Requiring the service to be able to respond normally during an ingestion raises the bar in terms of both capacity and performance requirements. I don't know if that's acceptable from a product perspective, though.
We will, of course, want to provide our best estimate on requests per [day|hour|insert-time-frame]. I'm guessing we'll be okay there - the fact that the dataset is read only outside ingestion periods helps a lot - but it is reasonable that we'll need to supply a number with our request. @kostajh , @Tgr , any insights based on your experience with "add a link"?
The Add Link API is used differently so the load it gets won't be informative. We'll request the Image Suggestion API when starting a new suggested edit session, which is roughly equivalent to analytics/legacy/homepagemodule events with action=se-task-click and a link-recommendation task type associated with them, or analytics/legacy/newcomertask events with action=postedit-task-click and the same task type. Here are some stats for that from September:
SELECT COUNT(*) count FROM event.homepagemodule hpm LEFT JOIN event.newcomertask nt ON STR_TO_MAP(hpm.event.action_data, ';', '=')['newcomerTaskToken'] = nt.event.newcomer_task_token AND nt.year=hpm.year AND nt.month=hpm.month WHERE hpm.event.module = 'suggested-edits' AND hpm.event.action = 'se-task-click' AND nt.event.task_type = 'link-recommendation' AND hpm.year=2021 AND hpm.month=9 ; count 4443 SELECT COUNT(*) count FROM event.helppanel hp LEFT JOIN event.newcomertask nt ON STR_TO_MAP(hp.event.action_data, ';', '=')['newcomerTaskToken'] = nt.event.newcomer_task_token AND nt.year=hp.year AND nt.month=hp.month WHERE hp.event.action = 'postedit-task-click' AND nt.event.task_type = 'link-recommendation' AND hp.year=2021 AND hp.month=9 ; count 5196
So something like 320 requests per day, with the feature enabled on 13 wikis (ar, bn, cs, de, es, fa, fr, hu, nl, pl, ro, ru, vi).
I think a reasonable predictor for API usage is the number of edits done by user accounts which are less than 30 day old. That's 95K for the wikis listed above, and 360K for all Wikipedias. So I would guesstimate something like 1200 requests per day after all deployments are done. This doesn't take into account major changes to the feature such as T240513: Newcomer tasks: entry point in reading experience, which are hard to predict.
Thanks for the heads up @LSobanski.
We are indeed prototyping a storage/serving solution for ImageMatching data predicated on Cassandra. There's some initial discussion re datasets and schemas at https://phabricator.wikimedia.org/T293808.