Page MenuHomePhabricator

Data dumps for the MachineVision extension
Closed, ResolvedPublic

Description

Please set up the following tables for mysqldumping:

  • machine_vision_provider
  • machine_vision_image
  • machine_vision_label
  • machine_vision_suggestion
  • machine_vision_freebase_mapping

All are present only on commonswiki and (the to be deleted) testcommonswiki.

Event Timeline

@Mholloway I was thinking that an appropriate dump would have the following schema, what do you think?

+-------------------------------------------+-----------------+-------------+----------+
| image_file_title                          | wikidata_id     | freebase_id | accepted |
+-------------------------------------------+-----------------+-------------+----------+
| File:Women_at_work,_Gujarat_(cropped).jpg | Q5113           |   /m/01000j |     true |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q31528          |   /m/01000j |     true |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q31455          |   /m/01000j |    false |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q316342         |   /m/01000j |     true |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q212771         |   /m/01000j |    false |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q13360514       |   /m/01000j |     true |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q43238          |   /m/01000j |     true |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q241741         |   /m/01000j |    false |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q25978          |   /m/01000j |    false |
+-------------------------------------------+-----------------+-------------+----------+

Would it be desirable to output unmapped freebase_id as well?

My presumption would be that we'd dump the machine_vision_label, machine_vision_suggestion, and machine_vision_provider tables in their entirety with little to no filtering or transforming, but I'm not sure what's usually done.

We're not currently storing Google entity IDs that can't be mapped to Wikidata IDs upon receipt.

From the linked draft doc, looks like XML is the prescribed format for extension data dumps.

Ping @ArielGlenn as a heads-up that this is something we're thinking about.

Mholloway renamed this task from Figure out data dumps to Figure out data dumps for the MachineVision extension.Oct 29 2019, 3:29 PM

The easy is just to mysqldump the tables, in which case I add the specific tables to the list of tables we dump (see: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/snapshot/files/dumps/table_jobs.yaml ) and that's that.

This has the advantage of dumping only ids and scores or in the worst case a Qid number, no fields that could have problematic data in them. For example, if we dumped the image title, these titles are user-created and a bad actor could create an image with someone's personally identifying data in it or vulgarities etc, which we would not want to dump.

It's also much quicker; xml processing is done entry by entry, and is decidedly slow in comparison.

I have a few questions we'd want answered though, before deciding this is the way to go.

  • How big would these tables be after a year, after 3 years, for the biggest wikis (enwiki, wikidatawiki, commonswiki, dewiki)? How about across all wikis?
  • When an image file page is deleted or moved without redirect, what happens to corresponding entries in the tables?
  • When a Wikidata item is deleted or moved without redirect, what happens to entrys in the tables for which this item was used as a 'depicts' suggestion?

The main impetus for doing this is to provide data to Google to allow them to quantify the usefulness of the labels to the projects and the effect on Google Search discoverability of labeled images, in the event that they provide us with free Cloud Vision credits. That said, I don't know where that discussion currently stands, so we may or may not be going forward with this. In the meantime, I've answered your questions below.

The easy is just to mysqldump the tables, in which case I add the specific tables to the list of tables we dump (see: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/snapshot/files/dumps/table_jobs.yaml ) and that's that.

This has the advantage of dumping only ids and scores or in the worst case a Qid number, no fields that could have problematic data in them. For example, if we dumped the image title, these titles are user-created and a bad actor could create an image with someone's personally identifying data in it or vulgarities etc, which we would not want to dump.

It's also much quicker; xml processing is done entry by entry, and is decidedly slow in comparison.

Ideally we'd go that route. Specifically, we'd want to dump machine_vision_provider, machine_vision_image, machine_vision_label, and machine_vision_suggestion. Images are identified in machine_vision_image by SHA1 rather than title (and users are identified by numeric ID), so we shouldn't have trouble of the kind you mentioned.

That said, my understanding is that we'll have to include information that allows associating machine_vision_image rows with image titles and/or public (upload.wikimedia.org) URLs, which will require pulling in data from other sources (the image table, or wherever upload.wikimedia.org URLs are stored).

I have a few questions we'd want answered though, before deciding this is the way to go.

  • How big would these tables be after a year, after 3 years, for the biggest wikis (enwiki, wikidatawiki, commonswiki, dewiki)? How about across all wikis?

The extension is only active and only expected to be activated on commonswiki (except for testcommonswiki, which will be going away, possibly this month). The current sizes of these tables, after loading an initial batch of label suggestions for the feature launch, are as follows:

  • machine_vision_provider is currently a single row and expected to stay in the single digits for the foreseeable future. Its size is negligible.
  • machine_vision_image is currently 225,947 rows and 17.30 MB (~80.3 bytes/row).
  • machine_vision_label is currently 1,836,575 rows and 94.43 MB (~53.9 bytes/row).
  • machine_vision_suggestion is currently 1,836,575 rows and 38.31 MB (~21.9 bytes/row)

Going forward, the tables will increase in size as labels are continuously requested for uploaded bitmap images. Over the period of December 2018 through November 2019, bitmap images were uploaded at a rate of approximately 500,000/mo., and 8.12 labels were suggested per image. Assuming that the bytes/row for each table and the current rate of growth remain constant, the projected sizes of each table after one and three years are:

tablerows (1 yr)size (1 yr)rows (3 yrs)size (3 yrs)
machine_vision_image6,225,947476.78 MB18,225,9471,395.74 MB
machine_vision_label50,556,5752,598.76 MB147,996,5757,607.47 MB
machine_vision_suggestion50,556,5751,055.90 MB147,996,5753.090.98 MB

A couple of notes:

  1. There is currently a 1:1 relationship between the rows in machine_vision_label and those in machine_vision_suggestion. In the event that another machine vision provider is added and labels are requested from both upon each upload, machine_vision_suggestion would then grow more quickly than machine_vision_label, as only one row would be recorded in machine_vision_label for, say, the label "house cat" (Q146) for File:Foo.jpg, but a separate row would be added to machine_vision_suggestion for each provider that suggested that label.
  1. Depending on the popularity/usage of the feature, we may run an additional one-time import of labels for files used on two or more non-Commons wikis. There are currently approximately 2 million of these, so that would mean a one-time increase of 2m rows (153.16 MB) to machine_vision_image, and 16.24m rows to machine_vision_label and machine_vision_suggestion (834.79 MB and 339.18 MB, respectively).
  • When an image file page is deleted or moved without redirect, what happens to corresponding entries in the tables?

The entry for the image is deleted from machine_vision_image, and all entries in the other tables associated with the image are also deleted.

  • When a Wikidata item is deleted or moved without redirect, what happens to entrys in the tables for which this item was used as a 'depicts' suggestion?

At present, nothing happens to these entries.

Thanks for these updates. The sizes look quite reasonable, even allowing for unexpected growth.

image urls aren't stored in any table so we wouldn't be able to provide them. How are suggestions associated with specific images currently? Where's the reference to an image id?

Even if Google doesn't use these files in the end, they may be valuable to researchers, so it will stay on my todo list unless the entire MachineVision approach is scrapped.

Thanks for these updates. The sizes look quite reasonable, even allowing for unexpected growth.

image urls aren't stored in any table so we wouldn't be able to provide them. How are suggestions associated with specific images currently? Where's the reference to an image id?

Each of the rows in machine_vision_label and machine_vision_suggestion is associated with a row ID from machine_vision_image representing a single image; and the image SHA1 stored there is used to identify the image in the image table by img_sha1. (We are using SHA1 digests rather than titles to identify images in order to avoid duplicate files getting different label suggestions and votes.)

Even if Google doesn't use these files in the end, they may be valuable to researchers, so it will stay on my todo list unless the entire MachineVision approach is scrapped.

Sounds good to me!

Sorry, slight correction: every machine_vision_label entry is associated with a machine_vision_image row, and every machine_vision_suggestion row is in turn associated with a machine_vision_label row. But the point is that everything is ultimately associated with an SHA1 that can be looked up in the image table.

Right, the sha1 column is indexed then? I hadn't bothered to check that. We dump the image table in any case so that would be available a couple of times a month, not exactly matching the machine vision dumps but close enough.

Yep, the sha1 column is indexed. Good point about the image table getting dumped separately—that should be fine.

Mholloway renamed this task from Figure out data dumps for the MachineVision extension to Data dumps for the MachineVision extension.Dec 12 2019, 2:56 PM

@Cparle Could you sync up with Ariel to get these CAT dumps finalized? Thanks! 😄

The easy is just to mysqldump the tables, in which case I add the specific tables to the list of tables we dump (see: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/snapshot/files/dumps/table_jobs.yaml ) and that's that.

I think this is all that needs to happen.

Shall I add these as a weekly run?

That works for me! 😺

Shall I add these as a weekly run?

Change 573351 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] weekly dump of machine vision tables from commonswiki

https://gerrit.wikimedia.org/r/573351

I need to have one more look at the tables to be sure there's nothing in there we don't want public, and check the puppet manifests too before they go live.

Because these tables are only on commonswiki it makes sense to do them as a separate job instead of folding them into the regular xml/sql dumps which are across all wikis.

With any luck, these will go live by the end of the week and next week sometime we'll have the first output for checking. If it looks good we can figure out what to put on the web page https://dumps.wikimedia.org/other/ to advertise them.

Adding @Reedy (please feel free to remove yourself and point me to someone else if you think that's better) to give a final thumbs up/down on making the contents of these tables public. They look fine to me but I'd rather go by the book.

https://www.mediawiki.org/wiki/Extension:MachineVision/Schema/machine_vision_freebase_mapping for what's in them.

I ran a little test of theand it worked just fine from the command line. Output files are in a temporary location only available on dumpsdata1001 for now.

@Reedy do we have a rough ETA on this? Our friends at Google keep asking. Thanks!

Sorry for the delay

Is there any reason machine_vision_safe_search isn't being dumped too (for completeness, ie dumping everything)? I don't see anything particularly sensitive there, or is it some sort of "private" data from the classifier as to how they're ranking the images? Mostly for clarity/documentation purposes if this is the case, especially if someone comes along later and wants to know why it's not being dumped, or in the cloud replicas etc

I note mvl_uploader_id and mvl_reviewer_id don't seem to be exposed anywhere by the extension, but the information isn't "private" - you can work out who uploaded an image (and then eventually get their user id - it's exposed in the API etc), and from Special:Contributions/History you can see who added what machine assisted claims etc

Same goes for the parent task, this is fine to go ahead, as is T238574: Create wiki replica views for MachineVision extension tables

Edit:

It's noted from reading T248574: GPUs are not correctly handling multitasking

Please create the views necessary to expose the MachineVision extension tables to the wiki replicas. Thanks!

That page lists machine_vision_safe_search too, but doesn't say it shouldn't be exposed on replicas

So that inconsistency does need clarifying and sorting out

I'm happy to dump the machine_vision_safe_search table too if folks want it. Is the output liable to be useful at all to downloaders?

I'm happy to dump the machine_vision_safe_search table too if folks want it. Is the output liable to be useful at all to downloaders?

Would need to be answered by Ramsey (though it would sound like it's giving Google back the data it gave us ;)) or Michael

Mostly, it's the discrepency between the two requests. Either we should include it in both, or neither IMHO :)

(though it would sound like it's giving Google back the data it gave us ;))

Yeah, this was why I omitted it. But I think dumping it too is fine.

(though it would sound like it's giving Google back the data it gave us ;))

Yeah, this was why I omitted it. But I think dumping it too is fine.

Dumping it keeps it consistent, and then we don't need to document why it's not dumped, but exposed in labs etc etc

I'm gonna merge and dpeloy the upated patch by tomorrow morning if I hear no objections. That's EE morning so get your complaints in now!

Thanks, everyone! Are the dumps available for download now? (and if so, where? 😺 )

Unfortuantely not yet.

Change 573351 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] weekly dump of machine vision tables from commonswiki

https://gerrit.wikimedia.org/r/573351

Hasn't been merged

Change 573351 merged by ArielGlenn:
[operations/puppet@production] weekly dump of machine vision tables from commonswiki

https://gerrit.wikimedia.org/r/573351

Yeah I gave folks an extra couple days just in case, since the cron runs Saturday. Expect the first round of dumps then.

Change 587709 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] make sure machine vision dumps dir is created on all dumps servers

https://gerrit.wikimedia.org/r/587709

Change 587709 merged by ArielGlenn:
[operations/puppet@production] make sure machine vision dumps dir is created on all dumps servers

https://gerrit.wikimedia.org/r/587709

@Mholloway care to give me a one line description that I can use for the index.html page mentioned above? See the existing page for examples. I could announce it on the xmldatadumps-l mailing list afterwards, unless you care to do the honours.

@ArielGlenn How about "MachineVision extension tables"? Keeps it simple.

Change 589561 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] add machine vision tables dump to the web page for 'other' dumps

https://gerrit.wikimedia.org/r/589561

Change 589561 merged by ArielGlenn:
[operations/puppet@production] add machine vision tables dump to the web page for 'other' dumps

https://gerrit.wikimedia.org/r/589561

This is done now. Feel free to announce it wherever you like. After I have sent mail to the xmldatadumps-l list, I will close this task unless there is something else you need.