⚓ T245595 MediaModeration extension MVP

Status	Assigned	Task
Declined	None	T247977 Implement Hash Checking of Media Files
Resolved	Peter.ovchyn	T245595 MediaModeration extension MVP
Resolved	CCicalese_WMF	T246006 Name Hash Checking Extension
Resolved	Peter.ovchyn	T246206 PhotoDNA integration
Resolved	Peter.ovchyn	T246209 Create wireframe for extension
Resolved	• Pchelolo	T246243 Set up CI for MediaModeration extension
Resolved	Peter.ovchyn	T246255 MediaModeration: Strategy for handling child protection media
Resolved	Peter.ovchyn	T247229 Introduce UserMailer as a service
Resolved	Peter.ovchyn	T246806 Run requests asynchronously using JobSpecification
Resolved	Peter.ovchyn	T246807 Avoid using terrorism word everywhere
Resolved	• eprodromou	T246915 Send 160x160 thumbnails to photo DNA instead of real files
Resolved	Peter.ovchyn	T246916 Add metrics around requests to photoDNA
Resolved	Peter.ovchyn	T247927 Form email body and subject from template
Resolved	• Pchelolo	T247943 Deploy MediaModeration Extension to Wikimedia Production
Resolved	sbassett	T248483 Security Readiness Review For MediaModeration
Resolved	Peter.ovchyn	T246246 Document MediaModeration extension, create a dedicated Phabricator project for its tasks, and tag those tasks
Invalid	None	T247942 Create MediaModeration page on Mediawiki.org with detailed info and instruction
Resolved	Peter.ovchyn	T247984 Create maintenance script to queue images for checking
Resolved	CCicalese_WMF	T253319 Configuration variable to check files at upload time
Resolved	Art.tsymbar	T254499 Scan images in chronological order
Resolved	CCicalese_WMF	T255709 Improved error/debug logging for MediaModeration
Resolved	Peter.ovchyn	T257845 Remove developers and product team from email notification list
Resolved	• Pchelolo	T258653 Decrease processMediaModeration job concurrency
Resolved	CCicalese_WMF	T259742 Turn off MediaModeration debug logging in Production

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 19 2020, 10:45 AM

I would propose to NOT make a standalone service for this, but instead make a MediaWiki extension and rely heavily on the JobQueue. The reasons for this are numerous:

The need for asynchronous processing will make us use the JobQueue regardless.
All the tasks downstream from calling PhotoDNA API - flagging the image, deleting the image will need to be done by MediaWiki and initiated either via api in case of a service or directly in case of an extension
Sending an email currently is also done by MW.

So, there's nothing to do for a standalone service except being called from the jobrunner, making an outgoing call to PhotoDNA API and calling back into MW to do the followup task. So I propose not to make a standalone service at all. In case we end up extending the PhotoDNA, or building custom methods for image processing, we could build a custom lambda-style service at that point, but the whole infrastructure proposed here would just call out into it instead of directly into PhotoDNA.

So, proposed architecture:

Build a ImageHashChecking (name TBD) extension
The extension would post a job when image is uploaded
The job would do all the work - call out to photoDNA and do all the subsequent actions described in the task

A good part of this is that this particular pattern is already in use in a new MachineVision extension. It calls out to google cloud vision APIs and updates structured data fr uploaded images. The approach I propose here is exactly the same. You can review the extension code and utilize all the same patterns.

Peter.ovchyn claimed this task.Feb 21 2020, 9:35 AM

Peter.ovchyn moved this task from Backlog to In Progress/Doing on the Platform Team Workboards (S&F Workboard) board.

I would propose to NOT make a standalone service for this, but instead make a MediaWiki extension and rely heavily on the JobQueue.

I totally agree. After research, I came to the same idea. In terms of the description, there's nothing to do for standalone microservice. And moreover, most probably we'll have to add some Extension to support the communication channel between MW server and service.

Still, standalone service has some advantages:

The extension (even if it separated from code) is still monolith, and increase the load to the server and codebase.
Microservice architecture is more flexible in terms of extension. I mean that in the future microservice can be extended easily without risks to break the main server.
Microservice can communicate with DB directly. It's controversial, a bit dangerous, but possible. This way won't run operations on the main server and load it.

Disadvantages:

Still need changes on main server (Extension, DB changes, etc.)
It demands resources for maintenance processes. Say if we're going to use DB somehow directly - any kind of db and any change should be reflected on service also. And this way, SQLite can't be used as it's in-process.

Conclusion:

If we are eager to use Microseriveses - it worth trying this approach as an experiment.
If we want a reliable and quick solution - the extension is the best option.

CCicalese_WMF updated the task description. (Show Details)Feb 22 2020, 6:18 PM

Peter.ovchyn updated the task description. (Show Details)Feb 25 2020, 2:04 PM

CCicalese_WMF closed subtask T246006: Name Hash Checking Extension as Resolved.Feb 25 2020, 3:30 PM

Peter.ovchyn updated the task description. (Show Details)Feb 26 2020, 10:12 AM

Peter.ovchyn updated the task description. (Show Details)

If CPT goes for a separate code base (extension), can you please request or set up a dedicated Phabricator project tag to tag all Hash Checking extension tasks with, so people can easily find related tasks by looking at that dedicated project and its associated tasks (like T245595 or T246206)? Thanks. Moved to T246246

• Pchelolo renamed this task from Hash Checking MVP to MediaModeration extension MVP.Feb 26 2020, 4:57 PM

• Pchelolo closed subtask T246243: Set up CI for MediaModeration extension as Resolved.Feb 26 2020, 5:14 PM

CCicalese_WMF added a project: MediaModeration.Feb 26 2020, 5:55 PM

A couple of notes of what we've learned from a conversation with Microsoft regarding communication with the PhotoDNA platform:

Technically for them a 160x160 thumbnail of the image is better, cause it's the resolution the need for computing the hashes, anything above that resolution would not improve the recognition quality and will just result in higher bandwidth usage and worse overall performance. Note: we do have thumb nailing infrastructure and thumbnails are created on upload with a ThumbnailRender job. However, the minimum resolution we pre-render thumbnails is 320x320. So basically there's no real opportunity to optimize anything by trying to reuse prerendered thumbnails and wait for the thumbnail render job. We just transform the images with RENDER_NOW flag.
There exist an API when we simply POST binary data. We will know more when we get through the application and get app keys. They recommended using that API since it's the most efficient way.
For the Edge Hash API, the idea is that we're not computing the PhotoDNA hash, but computing a much simpler hash of the image, and they maintain a mapping from those signatures into the actual PhotoDNA hashes. The idea is to save up on bandwidth. The hash to compute though is not a simple md5, they distribute client libraries that perform hashing. They have libraries available in Java or .NET, so for now we will not use this API since it will require investment into running a Java service for hashing for yet unknown gain.
The PhotoDNA actually does not provide terrorism content checks, only child protection checks. Thus, we do not need to make different actions depending on the type of a match. They might add it later on, so the overall system design still needs to allow for extension
They provide testing database and testing set of images that will match positive, we will get those later on for integration testing.

Database upgrade (add flag to relevant tables. Create 'upgrade' script)

Please don't get into implementing this just yet. For legal reasons it might be inappropriate to store information about the images we discover and remove on-wiki, so we might go other direction and put this data off-wiki.

Peter.ovchyn updated the task description. (Show Details)Mar 3 2020, 5:57 PM

Peter.ovchyn updated the task description. (Show Details)

Peter.ovchyn closed subtask T246209: Create wireframe for extension as Resolved.Mar 3 2020, 6:02 PM

Peter.ovchyn closed subtask T246807: Avoid using terrorism word everywhere as Resolved.Mar 3 2020, 7:18 PM

Peter.ovchyn updated the task description. (Show Details)Mar 4 2020, 5:00 PM

Peter.ovchyn closed subtask T246206: PhotoDNA integration as Resolved.

Peter.ovchyn closed subtask T246806: Run requests asynchronously using JobSpecification as Resolved.

Jrogers-WMF subscribed.Mar 12 2020, 5:22 PM

Added some background info in the task description.

drochford subscribed.Mar 12 2020, 5:26 PM

Peter.ovchyn updated the task description. (Show Details)Mar 12 2020, 5:59 PM

Peter.ovchyn updated the task description. (Show Details)Mar 16 2020, 2:45 PM

• Pchelolo edited projects, added Platform Team Workboards (Initiatives); removed Platform Team Workboards (S&F Workboard).Mar 17 2020, 5:21 PM

• Pchelolo moved this task from Later to Now on the Platform Team Workboards (Initiatives) board.

CCicalese_WMF triaged this task as Medium priority.Mar 17 2020, 5:22 PM

Peter.ovchyn closed subtask T246255: MediaModeration: Strategy for handling child protection media as Resolved.Mar 18 2020, 10:38 AM

Peter.ovchyn reopened subtask T246255: MediaModeration: Strategy for handling child protection media as Open.

Peter.ovchyn mentioned this in T247942: Create MediaModeration page on Mediawiki.org with detailed info and instruction.Mar 18 2020, 11:16 AM

Peter.ovchyn added a subtask: T246246: Document MediaModeration extension, create a dedicated Phabricator project for its tasks, and tag those tasks.

CCicalese_WMF added projects: Platform Team Initiatives (Hash Checking), Epic.Mar 18 2020, 2:09 PM

CCicalese_WMF added a parent task: T247977: Implement Hash Checking of Media Files.Mar 18 2020, 2:33 PM

CCicalese_WMF removed a subtask: T247942: Create MediaModeration page on Mediawiki.org with detailed info and instruction.

CCicalese_WMF removed a project: Platform Team Workboards (Initiatives).Mar 18 2020, 5:45 PM

CCicalese_WMF added a project: Platform Team Workboards (Epics).Mar 20 2020, 7:56 PM

CCicalese_WMF moved this task from Epic Backlog to Doing on the Platform Team Workboards (Epics) board.

Peter.ovchyn updated the task description. (Show Details)Mar 24 2020, 3:06 PM

• Pchelolo mentioned this in T248483: Security Readiness Review For MediaModeration.Mar 25 2020, 2:57 PM

• Pchelolo removed a subtask: T246246: Document MediaModeration extension, create a dedicated Phabricator project for its tasks, and tag those tasks.

Peter.ovchyn closed subtask T246255: MediaModeration: Strategy for handling child protection media as Resolved.Mar 31 2020, 1:36 PM

Peter.ovchyn closed subtask T246916: Add metrics around requests to photoDNA as Resolved.

• Pchelolo closed subtask T247984: Create maintenance script to queue images for checking as Resolved.Apr 9 2020, 3:04 PM

• Pchelolo closed subtask T247927: Form email body and subject from template as Resolved.

CCicalese_WMF closed subtask T253319: Configuration variable to check files at upload time as Resolved.May 28 2020, 2:15 AM

• eprodromou set Due Date to Jun 30 2020, 4:00 AM.Jun 4 2020, 2:23 PM

CCicalese_WMF closed subtask T255709: Improved error/debug logging for MediaModeration as Resolved.Jun 19 2020, 12:11 AM

CCicalese_WMF closed subtask T247943: Deploy MediaModeration Extension to Wikimedia Production as Resolved.Jul 2 2020, 12:58 PM

CCicalese_WMF updated the task description. (Show Details)Jul 2 2020, 1:01 PM

CCicalese_WMF removed a subtask: T246477: MediaModeration should handle the case of quickly deleted files.Jul 2 2020, 1:48 PM

CCicalese_WMF updated the task description. (Show Details)Jul 2 2020, 1:51 PM

CCicalese_WMF updated the task description. (Show Details)Jul 2 2020, 7:30 PM

Change 606239 had a related patch set uploaded (by Cicalese; owner: Cicalese):
[operations/mediawiki-config@master] DO NOT MERGE Remove temporary logging for mediamoderation

https://gerrit.wikimedia.org/r/606239

gerritbot added a project: Patch-For-Review.Jul 6 2020, 3:54 PM

• eprodromou removed a subtask: T253320: Define decision-making contents of report emails.Jul 13 2020, 2:58 PM

• eprodromou closed subtask T257845: Remove developers and product team from email notification list as Resolved.Jul 20 2020, 3:10 PM

mdaniels5757 subscribed.Jul 22 2020, 7:15 PM

• eprodromou added a subtask: T258653: Decrease processMediaModeration job concurrency.Jul 22 2020, 9:13 PM

• eprodromou closed subtask T246915: Send 160x160 thumbnails to photo DNA instead of real files as Resolved.Jul 27 2020, 1:57 PM

• eprodromou closed subtask T254499: Scan images in chronological order as Resolved.Jul 28 2020, 1:15 PM

• Pchelolo closed subtask T258653: Decrease processMediaModeration job concurrency as Resolved.Jul 28 2020, 4:44 PM

@eprodromou:

Since testing is complete, shall we disable debug logging in production and on beta?

The maintenance script parameters should be documented on mediawiki.org.