Page MenuHomePhabricator

Evaluate PhotoDNA MediaModeration failures
Open, Needs TriagePublic

Description

Essex and I tried to run the MediaModeration script today, but as we looked in logstash, there were a ton of errors. Petr Pchelko confirmed it seemed like a high error rate so we paused to do more investigation.

To do this, we need to roll back the change made here https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/606239/9/wmf-config/InitialiseSettings.php#b6736 and set it to 'warning' so that we have more robust logs.

Findings from looking through the source

MediaModeration is not currently checking file type for thumbnail

We don't have visibility into load balancer/cdn behavior when photodna requests an image

  • We could send photodna file content instead of URLs
    • this would eliminate network troubleshooting and allow for local testing

We are not sure that the script actually identifies problematic images

additional info
It looks like the original implementation sent file content instead of urls

commit e5c6ee716b0230de5a0deec8f4a344cd02ffdc90
Author: Peter Ovchyn <peter.ovchyn@speedandfunction.com>
Date:   Tue Mar 3 15:45:27 2020 +0200

    Implement PhotoDNA integration using MWHttpRequest
    
    Bug: T246206
    Change-Id: I5a202c949436b9962e48dd52833aa12e37d129fa

but then it changed to sending urls as part of the thumbnail implementation:

commit 9028494fa014a423319090091592ee67994b1b44
Author: Peter Ovchyn <peter.ovchyn@speedandfunction.com>
Date:   Thu Mar 12 22:17:59 2020 +0200

    Send 160x160 thumbnails to photo DNA instead of real files
    
    Bug: T246915
    
    Change-Id: I6424f256bb4ba1cba6b115390b9f6e34f728cc5c

In T308451, I found that only ,05% of images are making it to PhotoDNA. We still don't know the exact reason for these errors though.

Event Timeline

@mepps What do you think about putting this script behind a CRON job once we get it fully fleshed out and stable?

@eigyan A good idea! I think the only issue is the starting and ending timestamps...

Change 708815 had a related patch set uploaded (by Eigyan; author: Eigyan):

[operations/mediawiki-config@master] wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script

https://gerrit.wikimedia.org/r/708815

Change 708832 had a related patch set uploaded (by Eigyan; author: Eigyan):

[operations/mediawiki-config@master] wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script

https://gerrit.wikimedia.org/r/708832

Change 708815 merged by jenkins-bot:

[operations/mediawiki-config@master] wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script

https://gerrit.wikimedia.org/r/708815

Mentioned in SAL (#wikimedia-operations) [2021-08-02T11:05:05Z] <urbanecm@deploy1002> Synchronized wmf-config/InitialiseSettings.php: 26bcaafdcd57b1b7a78f9e0ad000325baaf36a72: Restore logging for mediamoderation script to better understand high error rate occurring when running script (T287511) (duration: 00m 57s)

@eigyan I found an error message! It looks like the api is returning "The given file could not be verified as an image". I'm still curious if this an appropriate rate of this error. I found this in mw-log.

I checked one of the images that got this error and it does look like a real image: https://commons.wikimedia.org/wiki/File:For%C3%AAt_@_Mont_Veyrier_(51122922841).jpg. So I'm not sure what to make of this.

@mepps Agreed. Seems like that type of failure at a high rate means the script may not be working as expected perhaps a bug. Thus my question of how do we run this script in a sandbox and in order to see these types of errors in a non-production environment.

@mepps Do we have an API documentation resource for the script? I'm guessing/hoping Petr would the person to ask?

@eigyan There's some documentation on the PhotoDNA site, but we may need a login for more access.

@drochford got new credentials. I'm curious if we need to update the access token used on prod. @ARamirez_WMF @Madalina It might be worth bringing this in the sprint because we never did get to run the script all the way. The next step would be to ask David to look up the token on the dashboard, and to ask SRE for what token is stored in $wgMediaModerationPhotoDNASubscriptionKey in the private config repo. We could also use the credentials and login to see the PhotoDNA documentation.

Change 737499 had a related patch set uploaded (by EllenR; author: gerrit:ellenr):

[mediawiki/extensions/WikimediaMessages@master] Update language files to support Beta QA testing for QuickSurvey

https://gerrit.wikimedia.org/r/737499

Change 708832 abandoned by Eigyan:

[operations/mediawiki-config@master] wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script

Reason:

https://gerrit.wikimedia.org/r/708832

Hmm so I was able to confirm that the api key is current.

I'm re-running the script and still seeing these errors. I don't have a lot more information about why, or how many images are getting correctly scanned. I'd like the script to give more feedback in the long run.

ARamirez_WMF added a subscriber: eigyan.
ARamirez_WMF renamed this task from Investigate MediaModeration failures to Contact PhotoDNA regarding MediaModeration failures.Mar 7 2022, 4:16 PM
ARamirez_WMF assigned this task to mepps.

The error code is 3206.

From @ERayfield Status codes and corresponding descriptions:
3000: OK
3002: Invalid or missing request parameter(s)
3004: Unknown scenario or unhandled error occurred while processing request
3206: The given file could not be verified as an image
3208: Image size in pixels is not within allowed range (minimum size is 160x160 pixels; maximum size is 4MB)
EvaluateResponseCollection of image evaluation flags:
AdultClassificationScore: Numeric score representing the likelihood of adult content
IsImageAdultClassified: Boolean representing whether or not adult content was found
RacyClassificationScore: Numeric score representing the likelihood of racy content
IsImageRacyClassified: Boolean representing whether or not racy content was found
AdvancedInfo: reserved for future use
Result: Boolean representing whether or not adult and/or racy content was found
Note: this object is null unless the header 'Enable-Evaluation' is present and a valid Content Moderator key has been provided in PhototDNA portal.
https://developer.microsoftmoderator.com/docs/services/57c7426e2703740ec4c9f4c3/operations/57c7426f27037407c8cc69e6?ref=mktg

A few more pieces of info I gathered today.

A sample api error response:

2022-03-07 20:41:22 [10315b4a0caeeb1c23e29f37] mw1338 commonswiki 1.38.0-wmf.24 mediamoderation WARNING: Hash check of file NOITE_ELEITORAL_06_10_2019_(Esquerda.Net_49105669418).jpg failed. Error response from PhotoDNA service: {^M
 "Status": {^M
  "Code": 3206,^M
  "Description": "The given file could not be verified as an image",^M
  "Exception": null^M
 },^M
 "ContentId": null,^M
 "IsMatch": false,^M
 "MatchDetails": null,^M
 "XPartnerCustomerId": null,^M
 "TrackingId": "WUS_8832767f55c743fbb0b73619d78ed0f1_57c7457ae3a97812ecf8bde9_eb0496fa2a2543a596b67912f0f22be3",^M
 "EvaluateResponse": null^M
}

Also there are other logger->debug calls that don't seem to be getting logged.

Also the failing images are JPGs which should be approved, per this in docs:

The request body can be an image; the following MIME types are supported
Content-Type: image/gif
Content-Type: image/jpeg
Content-Type: image/png
Content-Type: image/bmp
Content-Type: image/tiff

I just looked in Thumbnail provider and there is a logger->warning if a file can't be made into a thumbnail. The logs are currently set at debug so we wouldn't know if these images weren't being correctly thumbnailed.

It may be that we are not getting to correct grey scale - it may be what is
coming out is not being defined correctly - put more info into slack

On log levels: maggie and I looked at this in a call and determined that the current default log level for the extension set in wmf-config is warn, which would exclude any debug log output. If we want debug info in our logs, we will need to adjust the log level in wmf-config to debug. We may still need to reach out to microsoft though, as it looks like our code is working (images are being sent to the api, responses are coming back), and the issue is that we don't understand what it is about our images that is not image like.

Thought:
I can see that createModerationRequest is sending thumbnail urls rather than the bytestream for the thumbnails directly in the post. We should check to see that those URLs can actually be accessed by the photodna service. If photodna is getting back an error message, then it would make sense that it's saying that it hasn't received a file.

We could also just update the implementation to send the image directly in the request, since that would be more testable/deterministic, and not dependent on the performance of our cdn, transient configuration issues, etc.

I've reached out to @drochford in slack to make sure I contact the correct person.

For my own notes: I'll be asking about the error we're seeing, as well as developer credentials for a sandbox account, if available.

I've reached out to Nicholas to confirm the best way to reach PhotoDNA.

I have sent an email to PhotoDNA support.

I followed up today because I haven't heard any kind of response.

I got some replies:

I can confirm that the access to PhotoDNA Cloud Service is limited to a single user sign in and there isn’t a way to add additional users to that account. And whereas there are a couple of test images that you can use to “test” on the service, there is no sandbox/pre-production environment to test against.
This is a question I will need to pull in Engineering to answer – will need to get back to you on this…

I got a response that needs more investigation:

Regarding the subjected issue for Wikimedia, we see two major concerns for the error “3206 : he given file could not be verified as an > image ”

Could not find a part of the path i.e we are getting incomplete URL, hence we are unable to download them

This is excepted as we are not getting the complete URL from the request.

Uploaded Image

The remote server returned an error: (403) Forbidden, looks like there is an issue while we are downloading the images.
LoadBytes error System.Net.WebException: The remote server returned an error: (403) Forbidden.

at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)

For #2, System receiving 403 for some images while we are able to access those URLs from postman, one possibility is that their server has some DoS protection and they have seen thousands or millions of requests from Content Moderator’s IP address so their system thinks we are an attacker.

Wikimedia team – can you please check to see if you are somehow rejecting Content Moderator’s requests because your system is seeing too many requests coming from the same IP address?

it might also be worth considering changing the moderation check to upload the image instead of sending the url.

be interested on the DOS due to many requests from the same IP - although I am not sure why there would be one, TBH

With some local testing I did find that if there's an error in the creation of the thumbnail it's not always caught. I'll need more info to know if that's what's happening here.

at suggestion from eng. channel have entered a request for information

waiting on response from T305863, then will wrap this up

Question from engineering talk: should we stop sending urls and send the photo instead? Why wasn't this done in the past? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaModeration/+/576338 seems to have done this before. The next step might be asking Cindy Cicalese.

mepps renamed this task from Contact PhotoDNA regarding MediaModeration failures to Evaluate PhotoDNA MediaModeration failures.Apr 26 2022, 3:46 PM

From Cindy on sending files versus urls: "I recall that change being made. I will check my email archive to see if I can reconstruct the reason. I have a feeling it was because they have a relatively small size limit if you send the file itself."

I can confirm that I saw no success messages in the logs today.

Do we know if the extension changes have been deployed to whatever group the maintenance script runs on? I know I haven't been tracking the train for additional output changes.

It doesn't look like they have, but I was searching for "Hash match found for file" and "No hash match found for file". I wish we had Ellen's messages in now, because it would have been helpful.

will output
#_EXCEPTION with more information about exception,
#_WARNING with more information on warning,
. (dot or period) if file was processed with no errors

  1. (pound/bang) if file was processed with errors

just as an fyi

Oh! I got exactly one "No hash match found" message now. Also, David Rochford says that there have been a total of 25 emails sent with matches.

As noted in the task above, I have that it is still a huge failure rate with only .05% of images being sent to photoDNA successfully.

from Aug.2021 concerning PhotoDNA talks to ease of hacking and the issues within the codebase of PhotoDNA itself. This does point to some relevant questions concerning the use of PhotoDNA as the primary arbitrator of images which exploit children.
https://www.hackerfactor.com/blog/index.php?/archives/931-PhotoDNA-and-Limitations.html