Page MenuHomePhabricator

MediaModeration extension MVP
Closed, ResolvedPublic

Description

  1. Develop an extension that can be invoked through the job queue when image/video/audio files are uploaded or can be called through a maintenance script to process already uploaded image/video/audio files.
  2. The tool would use PhotoDNA to compare the files with hashes in the industry-wide data bases for child protection and terrorism content. The tool would pull directly from the hash corpus using the API, if feasible.
  3. If a match between the corpus and the file is detected, the tool will:
    • Send an email with the link to the matching media file to the Trust and Safety team to review.
    • If the content is flagged as a child protection issue, the tool will automatically delete (takedown) the matched content in a parallel action to the email.
    • Terrorism content will not trigger an automatic takedown.
    • There should be a flag that can be set to determine if the tool is automatically taking down content or flagging so testing can be done of both methods
  4. It is preferable that this functionality be built into an extension that is invoked asynchronously through the job queue after a file is uploaded. It should not prevent the file from being uploaded and processed. Processing would happen in the background.
  • Create git project on gerrit
  • Create a wireframe.
  • PhotoDNA integration
  • Run requests asynchronously using JobSpecification to
  • Define the strategy for deleting files - Don't delete.
  • Send emails to "Trust and Safety team"
  • Create a page on MediaWiki:Extension portal (https://www.mediawiki.org/wiki/Extension:MediaModeration)
  • Deployment - In Progress.
  • Acceptance testing
  • Turn off debug logging in production

Some background from Foundation Legal: The purpose is to improve the Foundation’s existing workflows for child protection and terrorism related content. Each of these types of material will be treated differently, but there are aspects of the tools underlying both that can be built out in this MVP.

Currently, when the Foundation receives a report of images that depict child sexual abuse, we delete it from the projects and report it to law enforcement according to our legal requirements. This setup requires volunteers, who unlike staff have no professional training or mental health support, to initially deal with this very emotionally taxing content.

This MVP aims to protect the community from being exposed to such content in nearly all cases and get it off the platform a lot faster. It would check images against a database of hashed, known images of child sexual abuse to allow Foundation staff to remove them and report their existence to law enforcement.

This MVP could eventually plug in to other Trust & Safety workflows dealing with terrorism content for Foundation staff to review to see if they meet our existing criteria for credible threats of immediate harm.

This MVP will not automatically remove any content without human review by Foundation staff.

Details

Due Date
Jun 30 2020, 4:00 AM

Related Objects

StatusSubtypeAssignedTask
DeclinedNone
ResolvedPeter.ovchyn
ResolvedCCicalese_WMF
ResolvedPeter.ovchyn
ResolvedPeter.ovchyn
Resolved Pchelolo
ResolvedPeter.ovchyn
ResolvedPeter.ovchyn
ResolvedPeter.ovchyn
ResolvedPeter.ovchyn
Resolved eprodromou
ResolvedPeter.ovchyn
ResolvedPeter.ovchyn
Resolved Pchelolo
Resolvedsbassett
ResolvedPeter.ovchyn
InvalidNone
ResolvedPeter.ovchyn
ResolvedCCicalese_WMF
ResolvedArt.tsymbar
ResolvedCCicalese_WMF
ResolvedPeter.ovchyn
Resolved Pchelolo
ResolvedCCicalese_WMF

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I would propose to NOT make a standalone service for this, but instead make a MediaWiki extension and rely heavily on the JobQueue. The reasons for this are numerous:

  • The need for asynchronous processing will make us use the JobQueue regardless.
  • All the tasks downstream from calling PhotoDNA API - flagging the image, deleting the image will need to be done by MediaWiki and initiated either via api in case of a service or directly in case of an extension
  • Sending an email currently is also done by MW.

So, there's nothing to do for a standalone service except being called from the jobrunner, making an outgoing call to PhotoDNA API and calling back into MW to do the followup task. So I propose not to make a standalone service at all. In case we end up extending the PhotoDNA, or building custom methods for image processing, we could build a custom lambda-style service at that point, but the whole infrastructure proposed here would just call out into it instead of directly into PhotoDNA.

So, proposed architecture:

  • Build a ImageHashChecking (name TBD) extension
  • The extension would post a job when image is uploaded
  • The job would do all the work - call out to photoDNA and do all the subsequent actions described in the task

A good part of this is that this particular pattern is already in use in a new MachineVision extension. It calls out to google cloud vision APIs and updates structured data fr uploaded images. The approach I propose here is exactly the same. You can review the extension code and utilize all the same patterns.

I would propose to NOT make a standalone service for this, but instead make a MediaWiki extension and rely heavily on the JobQueue.

I totally agree. After research, I came to the same idea. In terms of the description, there's nothing to do for standalone microservice. And moreover, most probably we'll have to add some Extension to support the communication channel between MW server and service.

Still, standalone service has some advantages:

  1. The extension (even if it separated from code) is still monolith, and increase the load to the server and codebase.
  2. Microservice architecture is more flexible in terms of extension. I mean that in the future microservice can be extended easily without risks to break the main server.
  3. Microservice can communicate with DB directly. It's controversial, a bit dangerous, but possible. This way won't run operations on the main server and load it.

Disadvantages:

  1. Still need changes on main server (Extension, DB changes, etc.)
  2. It demands resources for maintenance processes. Say if we're going to use DB somehow directly - any kind of db and any change should be reflected on service also. And this way, SQLite can't be used as it's in-process.

Conclusion:

  • If we are eager to use Microseriveses - it worth trying this approach as an experiment.
  • If we want a reliable and quick solution - the extension is the best option.

If CPT goes for a separate code base (extension), can you please request or set up a dedicated Phabricator project tag to tag all Hash Checking extension tasks with, so people can easily find related tasks by looking at that dedicated project and its associated tasks (like T245595 or T246206)? Thanks. Moved to T246246

Pchelolo renamed this task from Hash Checking MVP to MediaModeration extension MVP.Feb 26 2020, 4:57 PM

A couple of notes of what we've learned from a conversation with Microsoft regarding communication with the PhotoDNA platform:

  • Technically for them a 160x160 thumbnail of the image is better, cause it's the resolution the need for computing the hashes, anything above that resolution would not improve the recognition quality and will just result in higher bandwidth usage and worse overall performance. Note: we do have thumb nailing infrastructure and thumbnails are created on upload with a ThumbnailRender job. However, the minimum resolution we pre-render thumbnails is 320x320. So basically there's no real opportunity to optimize anything by trying to reuse prerendered thumbnails and wait for the thumbnail render job. We just transform the images with RENDER_NOW flag.
  • There exist an API when we simply POST binary data. We will know more when we get through the application and get app keys. They recommended using that API since it's the most efficient way.
  • For the Edge Hash API, the idea is that we're not computing the PhotoDNA hash, but computing a much simpler hash of the image, and they maintain a mapping from those signatures into the actual PhotoDNA hashes. The idea is to save up on bandwidth. The hash to compute though is not a simple md5, they distribute client libraries that perform hashing. They have libraries available in Java or .NET, so for now we will not use this API since it will require investment into running a Java service for hashing for yet unknown gain.
  • The PhotoDNA actually does not provide terrorism content checks, only child protection checks. Thus, we do not need to make different actions depending on the type of a match. They might add it later on, so the overall system design still needs to allow for extension
  • They provide testing database and testing set of images that will match positive, we will get those later on for integration testing.

Database upgrade (add flag to relevant tables. Create 'upgrade' script)

Please don't get into implementing this just yet. For legal reasons it might be inappropriate to store information about the images we discover and remove on-wiki, so we might go other direction and put this data off-wiki.

Added some background info in the task description.

Change 606239 had a related patch set uploaded (by Cicalese; owner: Cicalese):
[operations/mediawiki-config@master] DO NOT MERGE Remove temporary logging for mediamoderation

https://gerrit.wikimedia.org/r/606239

@eprodromou:

Since testing is complete, shall we disable debug logging in production and on beta?

The maintenance script parameters should be documented on mediawiki.org.

@eprodromou Can the last box above be checked now since Petr resolved 259742 above? And then this ticket could be resolved?

@eprodromou, could you please sign off the ticket so we can close the MediaModeration project?

The MVP is working in production and the client is satisfied, so I'm going to mark this MVP as complete. Further user stories should go into T256982.