Page MenuHomePhabricator

Design storage for depicts suggestions queue
Closed, ResolvedPublic

Description

The current plan for suggesting depicts claims for structured data on Commons is to have a table with image key (probably sha1) + wikidata item ID, and then provide a review queue from which users can pick an image to review suggestions for.

This means that

  1. it should be possible to query images which have at least one unreviwed suggestion;
  2. presumably the list of those images would have to be sorted in some way (possibly in multiple ways) which makes sense to the user, such as most recently uploaded, most used/viewed, highest quality, my own uploads...
  3. there should be some mechanism to ensure we don't give the same image to many different users in parallel - by having a pop operation, or making the results sufficiently random.

Together these are nontrivial and probably not a great fit for a relational database. We should consider alternatives.

Event Timeline

Possibilities that come to mind:

  • Poor man's approach: sort by sha1. That will probably be part of the primary key of the depicts suggestion table (images do not have proper IDs, and it's more convenient than image name when considering renames and multiple revisions), so in theory everything can fit into a single index and no filesort is needed. (I couldn't get this to work when joining with the image table; maybe because img_sha1 is a non-unique index. Still, making two selects, one from the depicts suggestions table, and another to look up the images by sha1, would still be reasonable.) From the user's POV this would mean images in the queue are basically randomized. Removing from the queue would be managed by changing the review state flag to "under review".
  • Use an actual queue table, run a (slow) query regularly to fill it up with candidates. Copy all the image data to that table which is relevant to sort/search, and index by it (and keep it in sync with the primary data). Pop by deleting from the table. Keeping data in sync would probably be a major pain, plus this would take up extra storage space. As long as the sorting criteria are well defined and not too numerous, the queue can be kept small though - only the first X results for each sort criteria need to be in there. (That would not allow "dynamic" filtering like a given user's images, though.)
    • Use a queue, but use something other than SQL. I don't see any point in that but I'm not too familiar with the potential technologies (Redis? Kafka?) either.
  • Use ElasticSearch. This seems like the method of choice if we need to support multiple or complex sort criteria, or searches; there would have to be an "image has pending depicts suggestions" field on file objects in the ES index, updated as needed; that doesn't seem hard. I'm not sure ES can help avoid multiple people getting the same result; there would have to be some kind of locking mechanism using some different storage (again, does not seem hard).

When discussing the CirrusSearch implementation for SuggestedEdits, Erik B. suggested that he might be willing to implement a random sort order in CirrusSearch, though it would take some testing to gauge any adverse load/performance effects. Then we also get powerful search criteria and near-real-time indexing for free. As things stand today, that seems to me the best option.

Redis is a dream come true for the random suggestions case: we could populate a queue or set at our leisure based on whatever criteria we want, and then pop members on demand in O(1) time. Redis in active-active is possible via dynomite but opsen don't think that's worth the complexity/maintenance overhead. Yet we don't have a good alternative solution for a task queue available, and it comes up constantly. Special treatment for images uploaded by the logged-in user complicates matters a bit, but could probably be worked in somehow.

I do wonder if we could do something clever with Kafka here. The model seems a bit off, but maybe it's something to play with if we get some time after the basics are nailed down.

A queue table in MySQL sounds like a nightmare.

I'm sure the Growth team is thinking about this as well, for newcomer task suggestions. What architecture do they have in mind?

I'm not sure a MySQL queue is much different from other technologies. Popping elements by random is possible if a bit hacky (store a random column). You can search and sort to some extent (which the other queue options don't support, I think) but the exact details of that have to go into the schema design and are hard to change. Keeping duplicated data up to date is a chore (still easier than with Kafka, though), although for the obvious search/sort criteria (upload time, uploader) the details can't change. I think the main argument against it is that the production DB is optimized for high load (which won't really be the case here) so data gets replicated to many servers, making it relatively expensive.

CirrusSearch would be my top pick as well. In theory keeping details up to date is still a lot of work, but that already has to be done to support generic search so we'd be freeriding on that effort. One drawback is development speed: writing ES queries is harder than SQL queries. I'm unsure what the perf limitations are.

I'm sure the Growth team is thinking about this as well, for newcomer task suggestions. What architecture do they have in mind?

Nothing as of yet, still reviewing options, and much more focused on product concerns (which recommendation types actually teach freshly registered editors instead of distracting them?) than architectural ones.

Mholloway assigned this task to Tgr.

We've got the QueryPage setup in place and it seems perfectly workable. If we want to get fancier, we can open a new ticket.

This comment has been deleted.