Give users a download of their 'Contribution data'
Closed, InvalidPublic

Description

Value proposition

As a user, I want to get a report of the contribution data that the WMF keeps about me. The report, one of two (see T208636) will be in a machine-readable format.

Currently, the plan is to provide data that is similar in content to the Contributions page. I.e., links to the user's contributions, rather than the actual content (diffs) of the contributions.

Contents of the report

  • The report will contain the following information about the user:
  • All edits and logged actions I PERFORMED: e.g., page edits, page creations, page moves, page deletions, thanks, patrols, page protections, etc.
  • DON’T include deleted/suppressed edits or deleted summaries.

Title of the report

When the user saves the report, the filename will observe the following format: contribution-data_username_dd-mm-yyyy (where username is the username of the downloader and the date is the date of the download.)

Does this need QA?

Yes

@Mooeypoo, do we need to divide this up into sub-tasks? E.g., do you need separate methods for the edits I performed vs. things like Thanks (which is a logged action I suppose)? Or can this stay as one ticket. Also, do you need an investigation on any of this, or is it good to go?

kaldari added a subscriber: kaldari.Nov 7 2018, 7:55 AM

@jmatazzoni, @Mooeypoo - If this task ends up being complicated, for example, if it requires setting up a storage solution, let me know and we will reconsider whether or not it is necessary. From my conversations with Moriel, it sounds like the user data export should be straightforward, but this one may be complicated due to edge cases.

jmatazzoni updated the task description. (Show Details)Nov 7 2018, 9:52 PM

Also see T208636 why CSV might not be the best idea.

Is there a documented rationale available why CSV was chosen specifically?

There's an idea to split the way we serve the data to users for better support, but we should look into this further. The personal account data should be straight forward and will probably not require any storage solution or any heavy database access, but the contribution data can be very heavy both on Database queries and the resulting file (be it a CSV or XML or JSON or whatever other format we choose to use).

If the query takes a long time and/or is heavy on the database, we might need to use a job queue, which means the user may not immediately get their file. In that case, we'll need to store either the file or the gathered data somewhere temporarily until the user downloads. This can be a potential problem if we're doing it in production. A user with a high capacity bot -- or a user that generally is very active -- can have a huge amount of records for us to fetch, parse, and collate.

We could consider splitting the mechanism up so that personal data is given directly (in core or extension, but within the production cluster) since the database access is minor. For the contribution data, we could consider creating a Cloud tool that collects data from the replicas. This will allow us to use some small temporary storage outside of production, and will reduce load on production wikis in case of high demand or in cases where very prolific users are requested. Unlike profile data, contribution data is public and is taken from public records, so there's no issue of privacy within Cloud services.

We should take a look at the pros and cons of this and see if this is realistic and acceptable, given the scope of this project, time limitation, and goals.

jmatazzoni renamed this task from Give users a downloadable CSV of their 'Contribution data' to Give users a downloadable report of their 'Contribution data'.Nov 11 2018, 7:16 PM

Changed the title to "report" from "CSV." If this is supposed to be machine readable, then there is no reason for it to be CSV.

kaldari triaged this task as Low priority.Mon, Nov 19, 7:46 PM

Setting to low priority after discussing with legal. The main priority is to provide "personal data", i.e. data about their account rather than content created by their account.

jmatazzoni renamed this task from Give users a downloadable report of their 'Contribution data' to Give users a download of their 'Contribution data'.Tue, Nov 20, 10:09 PM
jmatazzoni closed this task as Invalid.Wed, Nov 28, 8:01 PM

I’d previously said that as part of this project we would be providing users with a download of their wiki contributions. But the Foundation has decided that such a download is not required, since we already provide users with reasonable access to their contribution data via the Contributions page. Moreover, providing such a download would be a bigger job than we thought it would, especially for users with a long contribution history. (Because their files could be quite large, we’d have to defer generation of the file, processing the request in the background, and then create a user interface both for downloading the file later and for notifying the user of when the file was ready....)

For this reason, the Annual Plan no longer requires a contributions download, and I’m marking this ticket (which I wrote) as invalid.

Tgr added a subscriber: Tgr.Fri, Dec 7, 1:04 AM
But the Foundation has decided that such a download is not required, since we already provide users with reasonable access to their contribution data via the Contributions page

Which is not a reasonable format for data download. We also provide access via the API though so no problem there.

What about deleted/revdeleted contributions, though? Those are still data about the user (what I write is absolutely very much data about me, it can contain personal information, it can be used to identify me via style matching etc) but not accessible to them. I think that still needs fixing, by tweaking permissions so that people can access their own deleted edits.

Moreover, providing such a download would be a bigger job than we thought it would, especially for users with a long contribution history. (Because their files could be quite large, we’d have to defer generation of the file, processing the request in the background, and then create a user interface both for downloading the file later and for notifying the user of when the file was ready....)

Users can download their data via the API, which does not have any of those problems. I don't think a nice UI for data download is a requirement. GDPR article 20 mandates us to make user data available in a machine-readable format; it does not mean we have to hold their hands.

(In fact I would be more worried about the opposite - T208636 makes no mention of exposing that data via an API. Per Article 20(2), the data subject shall have the right to have the personal data transmitted directly from one controller to another, where technically feasible, ie. the data export needs to be in a form that another MediaWiki installation is able to initiate (with the user's content). Clicking a button in your preferences does not really meet that requirement; an API available via OAuth would.)