Page MenuHomePhabricator

Investigate DMCA Takedown Form for requested updates
Closed, ResolvedPublic5 Estimated Story Points

Description

The Support and Safety team requests some improvements to some self-written tools to help them complete important tasks as part of their workflows. This ticket will be an investigation into the DMCA Takedown Form, while T159467 covers the Child Protection Takedown Form.

The tools are currently found at http://lcatools.corp.wikimedia.org@Jalexander will be able to provide login credentials to pertinent developers on request for investigation and/or development. Existing code is here: https://github.com/jamesryanalexander/lca-tools

What the tool does:

This tools is found either from the main page or left rail of the Trust and Safety tools wiki as "DMCA Takedown Form"

This is a form to fill out information about a DMCA takedown (both data about the takedown itself, as well as the file or page that is being taken down), and attach a file with additional information as needed. The form sends data (and the file) to Lumen Database (formerly Chilling Effects) via their API, and receives (for log and sugar case) the Lumen ID/URL.

The tool formats posts for:

  • WMFwiki -- provides link to post on WMFWiki and allows copy and paste of the main post
  • the User Talk page of the uploader -- also allows you to make the edit directly via MW OAuth in addition to copy/paste
  • When the takedown is on Commons, it formats posts for Commons Village pump and the Commons DMCA noticeboard. (It also allows one click posting of those via OAuth in addition to copy/paste).
  • The tool also creates a SugarCRM case with basic info about the takedown.

There is a separate sub-tool that only reports to Lumen without formatting the posts/creating sugar case. This is mostly used for when the API was down or issues happened.

Current screenshots

LegalTakedown1.png (1×2 px, 544 KB)

LegalTakedown3.png (1×2 px, 253 KB)

Requested changes:

  • Right now, the tool only tracks the files that are taken down. We also want to track the requests that don't lead to a takedown, so there should be a full log. There should likely be an option added for request granted (image removed) or denied (image kept.)
  • The 'Project' dropdown is currently hardcoded, can it be expanded or have an 'Other' that allows for a textbox? (with validation?)
  • Some of the fields should not be mandatory.
    • Anything required by the Lumen API should be required
    • If the request was not complied with, then nothing should be required
  • This should write to SalesForce instead of Sugar. The exact same data should be stored.
  • Potentially support multi-file support.

Open questions:

  • Will these requested changes require a complete rewrite, or small-scale fixes?
  • Where should these tools live — on the existing private server, or ToolLabs?
  • Should we merge the sub-tool with the main tool, or keep them separate?

Deliverables:

  • Written answers to open questions in this ticket
  • Written proposal for how to implement requested changes to the two tools
  • Documented knowledge learned to help with further development of this project
  • Any needed additional cards created

In an CommTech estimation meeting on March 7, 2017 this card was sized as a '5'.

Investigation results:


The tool currently, is a mixture of a bunch of different tools, which don't necessarily have a lot in common.

  1. DMCA takedown tool - A form for submitting information about the requested DMCA takedown.
  2. Child protection takedown tool - A form for submitting information about the requested child-protection takedown.
  3. Global search, Global link search, Global text search - Standalone tools to search across all of WMF wikis.
  4. Strategy tools - Also standalone tools for strategic decision making. Very infrequently used.

The first two tools are basically just a form, with some mandatory and some non-mandatory fields. There is an option for fetching data from CentralAuth for the file author(s) as well as the person filling up the form, to auto-populate some of the fields. It runs a bunch of checks/tests after the form is submitted and then it's sent to the database after some verifications to make sure the data won't be rejected at the database end. There is a smaller tool to retrieve a particular entry but from my understanding, it's also infrequently used.

The global search tools let you search for a string/link across all wikis deployed on the Wikimedia cluster. They should be made public for anybody to use. James mentioned that there was no particular reason that they were a part of the same tool besides lack of developer resources at the time of writing them.

The strategy tools contain private data and hence they shouldn't be made public. James expressed interest in keeping them clubbed with the DMCA and CP takedown tools for now.

To answer the open questions...

Will these requested changes require a complete rewrite, or small-scale fixes?

From my understanding, the changes requested are not small-scale from looking at the codebase. The code is complex, more than it needs to be. It could use a fair bit os structure and classes etc. There's a lot of code duplication in places and also the way it handles data (it gathers data from form, then sends it to the database, then retrieves it back and then uses it to structure wikitext for posting on Village pumps etc.) is also unnecessarily complicated. It doesn't use composer or any sort of external packages as far as I can see. I feel like if we attempt to patch up the current code, it'll only lead to more complexities and more time spent trying to understand the current code. I propose doing a rewrite but borrowing code from the current codebase as and when needed.

Where should these tools live — on the existing private server, or ToolLabs?

Due to the sensitive nature of some of the data being handled, we will probably need to leave it on the existing private server.

Should we merge the sub-tool with the main tool, or keep them separate?

I don't see a reason why they are two separate tools for posting on wiki/not posting. I would suggest we make it a checkbox at the end and merge the two tools into one.

Next steps:

Make a new tool for the DMCA and CP takedown tools

  1. The forms should replicate the functionality of the existing tool.
  2. Merge the two tools (one which posts on wiki and one which doesn't) into one (additional form checkbox).
  3. Pull in data from wikis wherever possible (like information about a commons file/uploader etc.)
  4. Validate data wherever possible
  5. Do away with local accounts and use OAuth (make sure to only allow access to specific staff accounts etc.)
  6. Log actions by users
  7. Retrieval function for given case ID (From the database)
  8. Copy over functionality for strategy tools (If needed?)
  9. Host on private server

And also, everything in Requested changes in the task description:

  • Right now, the tool only tracks the files that are taken down. We also want to track the requests that don't lead to a takedown, so there should be a full log. There should likely be an option added for request granted (image removed) or denied (image kept.)
  • The 'Project' dropdown is currently hardcoded, can it be expanded or have an 'Other' that allows for a textbox? (with validation?)
  • Some of the fields should not be mandatory.
  • Anything required by the Lumen API should be required
  • If the request was not complied with, then nothing should be required
  • This should write to SalesForce instead of Sugar. The exact same data should be stored.
  • Potentially support multi-file support. (CP takedown already has this, DMCA very rarely needs it)

Event Timeline

TBolliger updated the task description. (Show Details)
TBolliger set the point value for this task to 5.

This is more of a combined investigation for this ticket and T159467: Investigate Child Protection Takedown Form for requested updates.


The tool currently, is a mixture of a bunch of different tools, which don't necessarily have a lot in common.

  1. DMCA takedown tool - A form for submitting information about the requested DMCA takedown.
  2. Child protection takedown tool - A form for submitting information about the requested child-protection takedown.
  3. Global search, Global link search, Global text search - Standalone tools to search across all of WMF wikis.
  4. Strategy tools - Also standalone tools for strategic decision making. Very infrequently used.

The first two tools are basically just a form, with some mandatory and some non-mandatory fields. There is an option for fetching data from CentralAuth for the file author(s) as well as the person filling up the form, to auto-populate some of the fields. It runs a bunch of checks/tests after the form is submitted and then it's sent to the database after some verifications to make sure the data won't be rejected at the database end. There is a smaller tool to retrieve a particular entry but from my understanding, it's also infrequently used.

The global search tools let you search for a string/link across all wikis deployed on the Wikimedia cluster. They should be made public for anybody to use. James mentioned that there was no particular reason that they were a part of the same tool besides lack of developer resources at the time of writing them.

The strategy tools contain private data and hence they shouldn't be made public. James expressed interest in keeping them clubbed with the DMCA and CP takedown tools for now.

To answer the open questions...

Will these requested changes require a complete rewrite, or small-scale fixes?

From my understanding, the changes requested are not small-scale from looking at the codebase. The code is complex, more than it needs to be. It could use a fair bit os structure and classes etc. There's a lot of code duplication in places and also the way it handles data (it gathers data from form, then sends it to the database, then retrieves it back and then uses it to structure wikitext for posting on Village pumps etc.) is also unnecessarily complicated. It doesn't use composer or any sort of external packages as far as I can see. I feel like if we attempt to patch up the current code, it'll only lead to more complexities and more time spent trying to understand the current code. I propose doing a rewrite but borrowing code from the current codebase as and when needed.

Where should these tools live — on the existing private server, or ToolLabs?

I don't think there is a need for the tools to live on the private server (since the tool itself doesn't store any private data) but I will confirm this with James.

Should we merge the sub-tool with the main tool, or keep them separate?

I don't see a reason why they are two separate tools for posting on wiki/not posting. I would suggest we make it a checkbox at the end and merge the two tools into one.

Next steps:
  1. Take out the standalone tools from the current repo and deploy them as tool labs tools (maybe a single tool with the three features). This should be relatively simple and useful for the community too.
  2. Make a new tool for the DMCA+CP takedowns + strategy tools
    1. The forms should replicate the functionality of the existing tool.
    2. Merge the two tools (one which posts on wiki and one which doesn't) into one (additional form checkbox).
    3. Pull in data from wikis wherever possible (like information about a commons file/uploader etc.)
    4. Validate data wherever possible
    5. Do away with local accounts and use OAuth (make sure to only allow access to specific staff accounts etc.)
    6. Log actions by users
    7. Retrieval function for given case ID (From the database)
    8. Copy over functionality for strategy tools (If needed?)

And also, everything in Requested changes in the task description:

  • Right now, the tool only tracks the files that are taken down. We also want to track the requests that don't lead to a takedown, so there should be a full log. There should likely be an option added for request granted (image removed) or denied (image kept.)
  • The 'Project' dropdown is currently hardcoded, can it be expanded or have an 'Other' that allows for a textbox? (with validation?)
  • Some of the fields should not be mandatory.
  • Anything required by the Lumen API should be required
  • If the request was not complied with, then nothing should be required
  • This should write to SalesForce instead of Sugar. The exact same data should be stored.
  • Potentially support multi-file support. (CP takedown already has this, DMCA very rarely needs it)

Great write-up, Niharika. I'm fairly confident we can ignore the strategy tools for now. There aren't any requested changes, they're functional, and they're infrequently used.

I don't think there is a need for the tools to live on the private server (since the tool itself doesn't store any private data) but I will confirm this with James.

I believe the CP tool needs to live on a private server because we store a hashed version of the offending image(s).

I don't think there is a need for the tools to live on the private server (since the tool itself doesn't store any private data) but I will confirm this with James.

I believe the CP tool needs to live on a private server because we store a hashed version of the offending image(s).

That's stored in the database, not the tool itself. The "tool" is just the form and processing part. The actual images/hashes are in the DB, as far as I understood it.

That's stored in the database, not the tool itself. The "tool" is just the form and processing part. The actual images/hashes are in the DB, as far as I understood it.

The sha1 hashes are stored in the database in the file and filearchive tables. The images are stored on the scaling server. The hashes themselves are not private or sensitive.

I don't think there is a need for the tools to live on the private server (since the tool itself doesn't store any private data) but I will confirm this with James.

@Niharika: Were you able to confirm this with James?

@kaldari, here's the response I got from James when I asked him if the tool can live on tool labs.

While not necessarily against it I'd have to check with legal on that to be sure. While we're not storing anything in the way of private data we are processing it (offending images are uploaded/processed and sent externally for example and IP data is processed) and we'd be storing OAuth login information for people who could use it to do Checkusers or access suppressed information on the sites. I'm not sure what level of exposure we're ok with on labs given that it may be a bit more exposed then, say, production infrastructure

As for the user information, I don't believe we need to store anything that would be private/sensitive once they have approved the tool, as with our other tools.

Yeah, the OAuth info should not be an issue. That is all handled securely. I'm not sure I understand what James is saying about "offending images are uploaded/processed and sent externally for example and IP data is processed". @Jalexander, could you elaborate on that? When you say that images are uploaded/processed, where is that occurring? Are you uploading copies of the images into the tool itself? Or just referring to the copies on Commons? What do you mean by "sent externally"? Does the tool email a copy of the images to law enforcement? If so, we would need to double check who has access to the Tool Labs email server, but I don't imagine it would be a blocker. Mostly, we're worried about the tool storing its own copy of private/sensitive data. If that isn't happening, it's probably OK to host on Tool Labs.

Yeah, the OAuth info should not be an issue. That is all handled securely. I'm not sure I understand what James is saying about "offending images are uploaded/processed and sent externally for example and IP data is processed". @Jalexander, could you elaborate on that? When you say that images are uploaded/processed, where is that occurring? Are you uploading copies of the images into the tool itself? Or just referring to the copies on Commons? What do you mean by "sent externally"? Does the tool email a copy of the images to law enforcement? If so, we would need to double check who has access to the Tool Labs email server, but I don't imagine it would be a blocker. Mostly, we're worried about the tool storing its own copy of private/sensitive data. If that isn't happening, it's probably OK to host on Tool Labs.

Thanks for checking. You're right that we upload the file into the tool but we don't do any emails at the moment. They are sent externally via an API post. The highest risk spot for this is the Child Protection form where you upload the offending image to the tool and it sends it to the National Center for Missing and Exploited Children (who shares with law enforcement along with our other data and additional data they may have). The DMCA form has an option for this as well where, if the DMCA was sent as a PDF, it is uploaded and posted to Lumen. That's less of a damaging thing since we're generally posting it to commons anyway after that with some courtesy redaction.

IP data processing is similar in that for the CP form we give checkuser information for the offending users and that's processed and sent off to NCMEC. It's not stored by the tool itself, however, after processing.

... upload the offending image to the tool and it sends it to the National Center for Missing and Exploited Children

@Jalexander: Do you know if the images are stored locally to the file system (even temporarily)? If so, using Tool Labs might be risky.

... upload the offending image to the tool and it sends it to the National Center for Missing and Exploited Children

@Jalexander: Do you know if the images are stored locally to the file system (even temporarily)? If so, using Tool Labs might be risky.

In it's current implantation, yeah, because the php uploader stores it as a temporary file (and then when it submits it to the processing script does so telling it the temporary location).

Assuming we don't want to change that workflow, it probably means that we don't want to host the tool on Tool Labs. According to Bryan, its best to believe that anything in Tools can be seen by anyone else. There are only a small number of people with root access, but lots and lots of people have shell access and local root exploits are possible.

Assuming we don't want to change that workflow, it probably means that we don't want to host the tool on Tool Labs. According to Bryan, its best to believe that anything in Tools can be seen by anyone else. There are only a small number of people with root access, but lots and lots of people have shell access and local root exploits are possible.

Yeah, we definitely need the ability to send images for the CP tool so unfortunately can't remove it. It's possible we could find a way to do the whole form processing bit client side including the upload and send but I remember that at the time I couldn't find an easy way to do so). I fully admit that I wasn't working with the same knowledge or server restrictions so there may be a better way that it could be done and stay within tools.

They are sent externally via an API post. The highest risk spot for this is the Child Protection form where you upload the offending image to the tool and it sends it to the National Center for Missing and Exploited Children (who shares with law enforcement along with our other data and additional data they may have).

Are we doing any sort of encryption on the data before sending it? Should we? Also for when the user uploads the offending image to the tool?

As for tool labs, I feel like the risk is very low given the so few requests the CP tool sees in a year. We might be able to minimize it further by encrypting the data maybe?
I wonder if @bd808 has anything to add here.

I wonder if @bd808 has anything to add here.

@kaldari and I talked about this ticket a bit on irc yesterday. As he noted in T159898#3290732 my general advice for security in Tool Labs is that you should assume that anyone can see anything that is on disk there. We do what we can to ensure that security best practices are followed both for the virtual machines and the shared tool accounts, but zero-day local root exploits are found and it takes a while to get them patched. Storing or even transiting any sort of PII or sensitive information in Tool Labs is discouraged for this reason.

I also really really really do not want image content related to Child Protection form submissions to exist on the Tool Labs disks ever. This is asking for legal and social problems.

I've moved over the investigation results to the task description. Will create new tasks.

Great job, Niharika.

My only feedback is that I don't think we need to worry about re-writing the strategy tool.

@Niharika: I took the liberty of changing the results of the "Where should this live?" question based on subsequent discussion. Feel free to tweak further.

kaldari updated the task description. (Show Details)

Great job, Niharika.

My only feedback is that I don't think we need to worry about re-writing the strategy tool.

Yeah, good point. I took it out.