⚓ T236431 Data dumps for the MachineVision extension

Subject	Repo	Branch	Lines +/-
add machine vision tables dump to the web page for 'other' dumps	operations/puppet	production	+1 -0
make sure machine vision dumps dir is created on all dumps servers	operations/puppet	production	+2 -1
weekly dump of machine vision tables from commonswiki	operations/puppet	production	+151 -2

		Status	Subtype	Assigned	Task
		Resolved		Cparle	T238574 Create wiki replica views for MachineVision extension tables
		Resolved		ArielGlenn	T236431 Data dumps for the MachineVision extension

• Mholloway created this task.Oct 24 2019, 7:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 24 2019, 7:59 PM

MSantos updated the task description. (Show Details)Oct 25 2019, 5:37 PM

@Mholloway I was thinking that an appropriate dump would have the following schema, what do you think?

+-------------------------------------------+-----------------+-------------+----------+
| image_file_title                          | wikidata_id     | freebase_id | accepted |
+-------------------------------------------+-----------------+-------------+----------+
| File:Women_at_work,_Gujarat_(cropped).jpg | Q5113           |   /m/01000j |     true |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q31528          |   /m/01000j |     true |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q31455          |   /m/01000j |    false |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q316342         |   /m/01000j |     true |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q212771         |   /m/01000j |    false |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q13360514       |   /m/01000j |     true |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q43238          |   /m/01000j |     true |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q241741         |   /m/01000j |    false |
| File:Women_at_work,_Gujarat_(cropped).jpg | Q25978          |   /m/01000j |    false |
+-------------------------------------------+-----------------+-------------+----------+

Would it be desirable to output unmapped freebase_id as well?

My presumption would be that we'd dump the machine_vision_label, machine_vision_suggestion, and machine_vision_provider tables in their entirety with little to no filtering or transforming, but I'm not sure what's usually done.

We're not currently storing Google entity IDs that can't be mapped to Wikidata IDs upon receipt.

From the linked draft doc, looks like XML is the prescribed format for extension data dumps.

Ping @ArielGlenn as a heads-up that this is something we're thinking about.

• Mholloway renamed this task from Figure out data dumps to Figure out data dumps for the MachineVision extension.Oct 29 2019, 3:29 PM

The easy is just to mysqldump the tables, in which case I add the specific tables to the list of tables we dump (see: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/snapshot/files/dumps/table_jobs.yaml ) and that's that.

This has the advantage of dumping only ids and scores or in the worst case a Qid number, no fields that could have problematic data in them. For example, if we dumped the image title, these titles are user-created and a bad actor could create an image with someone's personally identifying data in it or vulgarities etc, which we would not want to dump.

It's also much quicker; xml processing is done entry by entry, and is decidedly slow in comparison.

I have a few questions we'd want answered though, before deciding this is the way to go.

How big would these tables be after a year, after 3 years, for the biggest wikis (enwiki, wikidatawiki, commonswiki, dewiki)? How about across all wikis?
When an image file page is deleted or moved without redirect, what happens to corresponding entries in the tables?
When a Wikidata item is deleted or moved without redirect, what happens to entrys in the tables for which this item was used as a 'depicts' suggestion?

ArielGlenn added a project: Dumps-Generation.Oct 30 2019, 6:35 AM

LGoto triaged this task as Medium priority.Oct 30 2019, 3:37 PM

LGoto moved this task from Needs triage to Upcoming on the Product-Infrastructure-Team-Backlog-Deprecated board.

• Mholloway claimed this task.Dec 9 2019, 7:06 PM

• Mholloway moved this task from Backlog to Ready for dev on the MachineVision board.

• Mholloway moved this task from Ready for dev to In development on the MachineVision board.

• Mholloway edited projects, added Product-Infrastructure-Team-Backlog-Deprecated (Kanban); removed Product-Infrastructure-Team-Backlog-Deprecated.

• Mholloway moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.

The main impetus for doing this is to provide data to Google to allow them to quantify the usefulness of the labels to the projects and the effect on Google Search discoverability of labeled images, in the event that they provide us with free Cloud Vision credits. That said, I don't know where that discussion currently stands, so we may or may not be going forward with this. In the meantime, I've answered your questions below.

In T236431#5618205, @ArielGlenn wrote:

The easy is just to mysqldump the tables, in which case I add the specific tables to the list of tables we dump (see: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/snapshot/files/dumps/table_jobs.yaml ) and that's that.

This has the advantage of dumping only ids and scores or in the worst case a Qid number, no fields that could have problematic data in them. For example, if we dumped the image title, these titles are user-created and a bad actor could create an image with someone's personally identifying data in it or vulgarities etc, which we would not want to dump.

It's also much quicker; xml processing is done entry by entry, and is decidedly slow in comparison.

Ideally we'd go that route. Specifically, we'd want to dump machine_vision_provider, machine_vision_image, machine_vision_label, and machine_vision_suggestion. Images are identified in machine_vision_image by SHA1 rather than title (and users are identified by numeric ID), so we shouldn't have trouble of the kind you mentioned.

That said, my understanding is that we'll have to include information that allows associating machine_vision_image rows with image titles and/or public (upload.wikimedia.org) URLs, which will require pulling in data from other sources (the image table, or wherever upload.wikimedia.org URLs are stored).

I have a few questions we'd want answered though, before deciding this is the way to go.

How big would these tables be after a year, after 3 years, for the biggest wikis (enwiki, wikidatawiki, commonswiki, dewiki)? How about across all wikis?

The extension is only active and only expected to be activated on commonswiki (except for testcommonswiki, which will be going away, possibly this month). The current sizes of these tables, after loading an initial batch of label suggestions for the feature launch, are as follows:

machine_vision_provider is currently a single row and expected to stay in the single digits for the foreseeable future. Its size is negligible.
machine_vision_image is currently 225,947 rows and 17.30 MB (~80.3 bytes/row).
machine_vision_label is currently 1,836,575 rows and 94.43 MB (~53.9 bytes/row).
machine_vision_suggestion is currently 1,836,575 rows and 38.31 MB (~21.9 bytes/row)

Going forward, the tables will increase in size as labels are continuously requested for uploaded bitmap images. Over the period of December 2018 through November 2019, bitmap images were uploaded at a rate of approximately 500,000/mo., and 8.12 labels were suggested per image. Assuming that the bytes/row for each table and the current rate of growth remain constant, the projected sizes of each table after one and three years are:

table	rows (1 yr)	size (1 yr)	rows (3 yrs)	size (3 yrs)
machine_vision_image	6,225,947	476.78 MB	18,225,947	1,395.74 MB
machine_vision_label	50,556,575	2,598.76 MB	147,996,575	7,607.47 MB
machine_vision_suggestion	50,556,575	1,055.90 MB	147,996,575	3.090.98 MB

A couple of notes:

There is currently a 1:1 relationship between the rows in machine_vision_label and those in machine_vision_suggestion. In the event that another machine vision provider is added and labels are requested from both upon each upload, machine_vision_suggestion would then grow more quickly than machine_vision_label, as only one row would be recorded in machine_vision_label for, say, the label "house cat" (Q146) for File:Foo.jpg, but a separate row would be added to machine_vision_suggestion for each provider that suggested that label.

Depending on the popularity/usage of the feature, we may run an additional one-time import of labels for files used on two or more non-Commons wikis. There are currently approximately 2 million of these, so that would mean a one-time increase of 2m rows (153.16 MB) to machine_vision_image, and 16.24m rows to machine_vision_label and machine_vision_suggestion (834.79 MB and 339.18 MB, respectively).

When an image file page is deleted or moved without redirect, what happens to corresponding entries in the tables?

The entry for the image is deleted from machine_vision_image, and all entries in the other tables associated with the image are also deleted.

When a Wikidata item is deleted or moved without redirect, what happens to entrys in the tables for which this item was used as a 'depicts' suggestion?

At present, nothing happens to these entries.

Thanks for these updates. The sizes look quite reasonable, even allowing for unexpected growth.

image urls aren't stored in any table so we wouldn't be able to provide them. How are suggestions associated with specific images currently? Where's the reference to an image id?

Even if Google doesn't use these files in the end, they may be valuable to researchers, so it will stay on my todo list unless the entire MachineVision approach is scrapped.

In T236431#5728787, @ArielGlenn wrote:

Thanks for these updates. The sizes look quite reasonable, even allowing for unexpected growth.

image urls aren't stored in any table so we wouldn't be able to provide them. How are suggestions associated with specific images currently? Where's the reference to an image id?

Each of the rows in machine_vision_label and machine_vision_suggestion is associated with a row ID from machine_vision_image representing a single image; and the image SHA1 stored there is used to identify the image in the image table by img_sha1. (We are using SHA1 digests rather than titles to identify images in order to avoid duplicate files getting different label suggestions and votes.)

Even if Google doesn't use these files in the end, they may be valuable to researchers, so it will stay on my todo list unless the entire MachineVision approach is scrapped.

Sounds good to me!

Sorry, slight correction: every machine_vision_label entry is associated with a machine_vision_image row, and every machine_vision_suggestion row is in turn associated with a machine_vision_label row. But the point is that everything is ultimately associated with an SHA1 that can be looked up in the image table.

Right, the sha1 column is indexed then? I hadn't bothered to check that. We dump the image table in any case so that would be available a couple of times a month, not exactly matching the machine vision dumps but close enough.

Yep, the sha1 column is indexed. Good point about the image table getting dumped separately—that should be fine.

• Mholloway edited projects, added Product-Infrastructure-Team-Backlog-Deprecated, SDC-Statements (Machine-vision-depicts); removed Product-Infrastructure-Team-Backlog-Deprecated (Kanban).Dec 12 2019, 2:56 PM

Restricted Application added a project: Structured-Data-Backlog. · View Herald TranscriptDec 12 2019, 2:56 PM

• Mholloway renamed this task from Figure out data dumps for the MachineVision extension to Data dumps for the MachineVision extension.Dec 12 2019, 2:56 PM

• Mholloway moved this task from Upcoming to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.

• Mholloway moved this task from In development to Tracking on the MachineVision board.Dec 18 2019, 12:20 AM

• Mholloway removed a project: Product-Infrastructure-Team-Backlog-Deprecated.Feb 12 2020, 5:13 PM

• Mholloway removed • Mholloway as the assignee of this task.Feb 18 2020, 10:40 PM

@Cparle Could you sync up with Ariel to get these CAT dumps finalized? Thanks! 😄

• Mholloway updated the task description. (Show Details)Feb 18 2020, 11:06 PM

In T236431#5618205, @ArielGlenn wrote:

The easy is just to mysqldump the tables, in which case I add the specific tables to the list of tables we dump (see: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/snapshot/files/dumps/table_jobs.yaml ) and that's that.

I think this is all that needs to happen.

Shall I add these as a weekly run?

That works for me! 😺

In T236431#5895968, @ArielGlenn wrote:

Shall I add these as a weekly run?

Change 573351 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] weekly dump of machine vision tables from commonswiki

https://gerrit.wikimedia.org/r/573351

gerritbot added a project: Patch-For-Review.Feb 19 2020, 6:35 PM

I need to have one more look at the tables to be sure there's nothing in there we don't want public, and check the puppet manifests too before they go live.

Because these tables are only on commonswiki it makes sense to do them as a separate job instead of folding them into the regular xml/sql dumps which are across all wikis.

With any luck, these will go live by the end of the week and next week sometime we'll have the first output for checking. If it looks good we can figure out what to put on the web page https://dumps.wikimedia.org/other/ to advertise them.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Feb 19 2020, 6:42 PM

Adding @Reedy (please feel free to remove yourself and point me to someone else if you think that's better) to give a final thumbs up/down on making the contents of these tables public. They look fine to me but I'd rather go by the book.

https://www.mediawiki.org/wiki/Extension:MachineVision/Schema/machine_vision_freebase_mapping for what's in them.

Reedy updated the task description. (Show Details)Feb 20 2020, 12:42 PM

I ran a little test of theand it worked just fine from the command line. Output files are in a temporary location only available on dumpsdata1001 for now.

ArielGlenn moved this task from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.Feb 28 2020, 6:25 PM

• Mholloway mentioned this in T238574: Create wiki replica views for MachineVision extension tables.Mar 10 2020, 3:01 PM

bd808 added a parent task: T238574: Create wiki replica views for MachineVision extension tables.Mar 10 2020, 3:03 PM

bd808 subscribed.

• Sarahmarie1981 edited projects, added WMDE-Tech-Communication-Source-Code-Berlin; removed Structured-Data-Backlog.Mar 16 2020, 3:34 PM

Aklapper removed a project: WMDE-Tech-Communication-Source-Code-Berlin.Mar 16 2020, 4:02 PM

MBinder_WMF mentioned this in T247891: Vandalism on Structured Data tasks.Mar 17 2020, 6:34 PM

JJMC89 added a project: Structured-Data-Backlog.Mar 17 2020, 6:46 PM

MSantos unsubscribed.Mar 18 2020, 2:02 PM

Reedy added a project: Security-Team.Mar 18 2020, 4:09 PM

Reedy moved this task from Incoming to Back Orders on the Security-Team board.Mar 18 2020, 4:14 PM

@Reedy do we have a rough ETA on this? Our friends at Google keep asking. Thanks!

Sorry for the delay

Is there any reason machine_vision_safe_search isn't being dumped too (for completeness, ie dumping everything)? I don't see anything particularly sensitive there, or is it some sort of "private" data from the classifier as to how they're ranking the images? Mostly for clarity/documentation purposes if this is the case, especially if someone comes along later and wants to know why it's not being dumped, or in the cloud replicas etc

I note mvl_uploader_id and mvl_reviewer_id don't seem to be exposed anywhere by the extension, but the information isn't "private" - you can work out who uploaded an image (and then eventually get their user id - it's exposed in the API etc), and from Special:Contributions/History you can see who added what machine assisted claims etc

Same goes for the parent task, this is fine to go ahead, as is T238574: Create wiki replica views for MachineVision extension tables

Edit:

It's noted from reading T248574: GPUs are not correctly handling multitasking

Please create the views necessary to expose the MachineVision extension tables to the wiki replicas. Thanks!

That page lists machine_vision_safe_search too, but doesn't say it shouldn't be exposed on replicas

So that inconsistency does need clarifying and sorting out

I'm happy to dump the machine_vision_safe_search table too if folks want it. Is the output liable to be useful at all to downloaders?

In T236431#6029847, @ArielGlenn wrote:

I'm happy to dump the machine_vision_safe_search table too if folks want it. Is the output liable to be useful at all to downloaders?

Would need to be answered by Ramsey (though it would sound like it's giving Google back the data it gave us ;)) or Michael

Mostly, it's the discrepency between the two requests. Either we should include it in both, or neither IMHO :)

In T236431#6030038, @Reedy wrote:

(though it would sound like it's giving Google back the data it gave us ;))

Yeah, this was why I omitted it. But I think dumping it too is fine.

In T236431#6032222, @Mholloway wrote:

In T236431#6030038, @Reedy wrote:

(though it would sound like it's giving Google back the data it gave us ;))

Yeah, this was why I omitted it. But I think dumping it too is fine.

Dumping it keeps it consistent, and then we don't need to document why it's not dumped, but exposed in labs etc etc

I'm gonna merge and dpeloy the upated patch by tomorrow morning if I hear no objections. That's EE morning so get your complaints in now!

Thanks, everyone! Are the dumps available for download now? (and if so, where? 😺 )

Unfortuantely not yet.

In T236431#5898309, @gerritbot wrote:

Change 573351 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] weekly dump of machine vision tables from commonswiki

https://gerrit.wikimedia.org/r/573351

Hasn't been merged

Change 573351 merged by ArielGlenn:
[operations/puppet@production] weekly dump of machine vision tables from commonswiki

https://gerrit.wikimedia.org/r/573351

Yeah I gave folks an extra couple days just in case, since the cron runs Saturday. Expect the first round of dumps then.

Maintenance_bot removed a project: Patch-For-Review.Apr 9 2020, 7:10 AM

Change 587709 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] make sure machine vision dumps dir is created on all dumps servers

https://gerrit.wikimedia.org/r/587709

gerritbot added a project: Patch-For-Review.Apr 9 2020, 10:17 AM

Change 587709 merged by ArielGlenn:
[operations/puppet@production] make sure machine vision dumps dir is created on all dumps servers

https://gerrit.wikimedia.org/r/587709

Maintenance_bot removed a project: Patch-For-Review.Apr 9 2020, 11:10 AM

The dumps ran on Saturday as expected and are now available, but we should add an entry to https://dumps.wikimedia.org/other/ describing these. See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/dumps/files/web/html/other_index.html

@Mholloway care to give me a one line description that I can use for the index.html page mentioned above? See the existing page for examples. I could announce it on the xmldatadumps-l mailing list afterwards, unless you care to do the honours.

@ArielGlenn How about "MachineVision extension tables"? Keeps it simple.

Change 589561 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] add machine vision tables dump to the web page for 'other' dumps

https://gerrit.wikimedia.org/r/589561

gerritbot added a project: Patch-For-Review.Apr 17 2020, 10:04 AM

Change 589561 merged by ArielGlenn:
[operations/puppet@production] add machine vision tables dump to the web page for 'other' dumps

https://gerrit.wikimedia.org/r/589561

Maintenance_bot removed a project: Patch-For-Review.Apr 17 2020, 10:10 AM

This is done now. Feel free to announce it wherever you like. After I have sent mail to the xmldatadumps-l list, I will close this task unless there is something else you need.

Sent: https://lists.wikimedia.org/pipermail/xmldatadumps-l/2020-April/001531.html so closing.

Thank you, @ArielGlenn!

ArielGlenn moved this task from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.May 20 2020, 8:08 AM

Data dumps for the MachineVision extension
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Data dumps for the MachineVision extensionClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Data dumps for the MachineVision extension
Closed, ResolvedPublic
Actions

Related Objects
Search...