Page MenuHomePhabricator

Define DB table schemas for MV depicts suggestions
Closed, ResolvedPublic

Description

Memorialize schemas for MV-generated depicts suggestions on-wiki to enable DBA review.

Event Timeline

Mholloway created this task.Jul 5 2019, 5:22 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 5 2019, 5:22 PM
Tgr added a subscriber: Tgr.Jul 19 2019, 11:11 AM

I suppose the plan is https://www.mediawiki.org/wiki/Readers/NSFW_image_filter/Storage_notes ? Some comments on that:

  • At least for the NSFW part, there doesn't seem to be any point in limiting it to Commons (vandals certainly aren't limited to it, and I don't think it would simplify much), so you'd need a wiki-id column.
  • Files get renamed; if the service uses file names as primary keys it'll have to track that in its own DB.
  • Files can have multiple versions. The service would have to discard scores when new versions are uploaded, or include the upload timestamp in the key.

I wonder it it's easer using the sha1 as a primary key? It makes lookups (both by and to the service) slightly more complicated, but images are indexed by sha1 so performance-wise it's not a problem, and sha1s are immutable so you'd avoid having to deal with all of the above issues. The allimages API can look up images by sha1, in MediaWiki the sha1 is available from the File object and files can be retrieved by sha1 from Repo/RepoGroup, so no new functionality would be needed as far as I can see. (Searching old versions of images by sha1 is theoretically possible but not currently exposed; but I don't think there's a use case for it here.)

(+@brion, @MusikAnimal for continuity with previous discussions)

@Tgr I think @brion's intent was for the table(s) to be on wiki-local DBs, meaning a wiki ID column wouldn't be necessary. That said, I'm not sure what would be advantage of keeping the tables local rather than putting in the shared DB.

I personally favor the idea of using sha1 as the primary key. I was worried (particularly for the AbuseFilter use case) about a possible lookup performance penalty for converting file names to sha1 values, but it sounds like that shouldn't be an issue.

Tgr added a comment.Jul 19 2019, 3:21 PM

Ah, OK, I thought this would be a standalone webservice with its own storage. If it's a MediaWiki extension, following file moves / reuploads is easier. Wrt performance, if you want to be able to look it up just from the wikitext that contains the image names, sha1 would be slower (although then per-wiki tables would not be ideal either; you'd probably want a shared table with name + wiki ID as key so you can check local + Commons with a single lookup). If the image record is to be loaded anyway, then sha1 doesn't make any difference.

Added draft table schemas at https://www.mediawiki.org/wiki/Extension:MachineVision/Schema and subpages.

+@Eevans, @daniel: Adam suggested that I ping you to have a look at the storage/lookup architecture for machine vision metadata, so please see schemas and related discussion above. Thanks!

Mholloway triaged this task as Normal priority.Jul 19 2019, 3:52 PM
Tgr added a comment.Jul 19 2019, 4:15 PM

Do we expect depict qualifiers to be relevant here? E.g. P2677 is something I'd expect most machine vision services to be able to return.

Good point. Relative position is something we are hoping that providers will provide along with labels.

+@Eevans, @daniel: Adam suggested that I ping you to have a look at the storage/lookup architecture for machine vision metadata, so please see schemas and related discussion above. Thanks!

Do you have an estimate of how many row each of these tables will have, and which exact queries (selects, inserts, updates) will be run against them, and how often each such query would be run?

Also, how stable to you assume this schema to be? How likely is it to change in, say, a year or two?

Mholloway renamed this task from Define DB table schemas for machine vision metadata to Define DB table schemas for MV depicts suggestions.Jul 24 2019, 6:48 PM
Mholloway updated the task description. (Show Details)
Tgr added a comment.Jul 30 2019, 9:24 AM

First shot at the schema. (This assumes one vote per label; changing that would require another table. Quite possibly we'll want another table to better track votes, anyway.)

Do you have an estimate of how many row each of these tables will have,

One per image label. An image will probably have dozens of labels, and I don't think there is a reason we wouldn't want to label most images eventually.
Product is interested in preserving all label records for future use in training our own classifier; there is no particular reason to do that in production though. I'm not sure what exactly the alternatives are (filed T229314: Decide where to store past labeling data about that), but I imagine it should be possible to keep the table size below some arbitrary limit and ship data about reviewed image labels to some analytics data store.

So the theoretical maximum is something like 50M * 20 ~= 1B rows, with a 10%-ish yearly increase; the actual maximum depends on how we decide to store the data once it only functions as a log of past reviews.
The initial focus will be high-quality images tagged via some community process (250K-ish) and images used on other wikis (millions?).

A side effect will be increase in mediainfo data, as these suggestions will reviewed and presumably most of them turned into actual labels. I assume that was already accounted for when planning for SDC storage.

and which exact queries (selects, inserts, updates) will be run against them, and how often each such query would be run?

A multi-row insert (or more if we use multiple machine vision providers) potentially (but not initially) after every file upload, and one per existing files (at a schedule of our choosing). An update for a small number of rows every time a user reviews the labels (I don't think we have a good estimate how much that will happen; but with a review queue serving different files to different users, it should be contention-free). Maybe a delete instead of an update, or afterwards at some point, depending on T229314. Selects are less clear; filed T229315: Design storage for depicts suggestions queue about that.

Also, how stable to you assume this schema to be? How likely is it to change in, say, a year or two?

This an experimental project done in haste, so pretty likely.

Mholloway closed this task as Resolved.Sep 10 2019, 1:29 AM

This seems like essentially a duplicate of T227355: DBA review for the MachineVision extension. I'm going to close this task in favor of it. Please feel free to chime in there, or reopen this if you think it deserves to be a separate task.