Page MenuHomePhabricator

Security Concept Review for the machine vision middleware project
Closed, ResolvedPublic

Description

Project Information

Description of the project and how it will be used

This is a project to support the incorporation of machine vision (MV) generated metadata into Foundation products. Specifically, the project will support:

  • Requesting MV-generated image metadata from machine vision providers (both internal and third-party external)
  • Providing temporary storage for MV results pending human editor verification
  • Serving MV data to Commons users for verification and promotion to Structured Data on Commons
  • Providing the results of human editor verification back to third-party MV providers for model refinement`

For full discussion, see the project page at https://www.mediawiki.org/wiki/Wikimedia_Product/Machine_vision_middleware (including linked Phab tickets).

Description of any sensitive data to be collected or exposed

None

Technologies employed

MediaWiki, MySQL, possibly (<50% likelihood) a Node.js service similar to cxserver

Dependencies and vendor code

Working test environment

This project in an early stage, but a couple of dev/testing APIs are set up in WMCS:

Event Timeline

sbassett triaged this task as Medium priority.Jul 9 2019, 6:32 PM

@Mholloway - thanks for submitting this request. The Security-Team will plan to review this soon and probably post some follow-up questions here. After that, we can provide some guidance on potential risks.

@Mholloway - from a high-level perspective, I think the Security-Team is fine with this and would assign a low risk for now, especially given the precedent of things like CX-cxserver. Some considerations:

  1. Has this been reviewed by WMF-Legal yet? I believe cx server went through a similar process; see Acceptance Criteria in T76185. This should probably happen prior to deployment.
  2. Some previous security reviews of the cx service (and related components) might be helpful to peruse at your leisure, as it's a close-ish parallel: T85686, T144467, T143185. Particularly the conversation starting at T143185#2632101, T144467#4795794 and T85686#958661. Mainly just as an idea of what we would look for during a more formal code review. Of course, the attack surface for this service would seem a bit smaller since we're talking about media metadata in a standard format that should not be easily manipulated by potential attackers, unlike the wikitext used by cx server.
  3. Though it wasn't formally noted, I assume every component of this will be using TLS.
  4. Even though some headers can be less critical for these types of services, I would strongly advise setting appropriate security headers, including a robust CSP for each component of this service.
  5. I'd also recommend getting this on the radar of the Performance-Team to see if they have any additional concerns or recommendations for best practices. It may also make sense to reach out to the Language-Team to see if they have run into any performance issues with their usage of the various 3rd party MTs for the cx server.
  6. Finally, once more code has been developed for this service (closer to production-ready), we can definitely perform a more formal security review if you'd like.

Hi @Mholloway , Do you have any additional questions or issues related to this ticket that we can help with? Please let us know! Otherwise we'll plan on closing this one out.

Cheers,

Jennifer

Thanks @sbassett for the review and @Jcross for checking in!

  1. Has this been reviewed by WMF-Legal yet? I believe cx server went through a similar process; see Acceptance Criteria in T76185. This should probably happen prior to deployment.

We're talking with them about what we're doing, and will follow up with them when we're code-complete.

  1. Though it wasn't formally noted, I assume every component of this will be using TLS.

Yes.

  1. Even though some headers can be less critical for these types of services, I would strongly advise setting appropriate security headers, including a robust CSP for each component of this service.
  2. I'd also recommend getting this on the radar of the Performance-Team to see if they have any additional concerns or recommendations for best practices.

Filed follow-up tasks for these.

  1. Finally, once more code has been developed for this service (closer to production-ready), we can definitely perform a more formal security review if you'd like.

I will plan on doing that; there's a placeholder task for it as well.

I'm currently in the process of adding support for Google Cloud Vision as a prospective third-party labeling provider. I plan for it to interact with the API through the official PHP client library. Is that something you would want to review specifically? I haven't looked at the implementation details yet, but I would expect that it follows recommended security practices.

That's all I've got, so feel free to close if we're all set for now. Thanks again.

We're talking with them about what we're doing, and will follow up with them when we're code-complete.

Great.

  1. Even though some headers can be less critical for these types of services, I would strongly advise setting appropriate security headers, including a robust CSP for each component of this service.
  2. I'd also recommend getting this on the radar of the Performance-Team to see if they have any additional concerns or recommendations for best practices.

Filed follow-up tasks for these.

Great, do you have the task numbers? I'd like to reference them here if possible.

I'm currently in the process of adding support for Google Cloud Vision as a prospective third-party labeling provider. I plan for it to interact with the API through the official PHP client library. Is that something you would want to review specifically? I haven't looked at the implementation details yet, but I would expect that it follows recommended security practices.

Ideally, for a security readiness review, we'd like to review any relevant source we have access to, including any and all third-party libraries. While we may not perform an exhaustive review of all third-party libraries (e.g. we aren't going to review 5,000 Node packages for some app) we would at least like to understand what is being used and why.

That's all I've got, so feel free to close if we're all set for now. Thanks again.

Sounds good. I'll resolve for now, but we can always post any relevant follow-up here (legal, related tasks, etc.)

Great, do you have the task numbers? I'd like to reference them here if possible.

Sure. Here are the follow-up tasks:
T230811: Add appropriate security headers to MachineVision
T230813: Performance review for the MachineVision extension
T227346: Security readiness review for the MachineVision extension

I expect to have code ready for the final round of reviews this week or next. Also, I should mention that the NSFW classification stuff is on hold for the time being; as of now the extension will only deal with image labeling.

Thanks again!

Thank you for the quick reply @Mholloway - please just let us know when we can be of further assistance. Cheers!

Per the MachineVision Concept Review (T227591), can we confirm that WMF-Legal's review of the extension/service was completed? Is there any corroborating documentation or a new privacy policy/ToU we could reference and review?

Hello! We have continued to keep Legal in the loop after T227591 was originally filed. Since then, the targeted Machine Vision API provider changed (from Clarifai to Google) and with that change came a slight change in requirements that meant we no longer needed to send usage data directly back to the provider (and data dumps contain no user information). As such, the only actual action the user is taking is making a structured data edit, just via a different means than usual. So there's no new policy/ToU other than the prominent notifications about the CC0 license of the structured data contribution. We still intend to review with legal once the tool is up for testing, before release.

I defer to @Slaporte for any additional input he may have