Page MenuHomePhabricator

Allow users to access their user data via an API
Closed, ResolvedPublic5 Estimated Story Points

Description

Value proposition

As a user, I want to get a machine-readable export of the personal information that the WMF has about me. It should be possible through the Mediawiki API to get a JSON export of the data (after authenticating as the user).

Contents of the report

The API response should provide the following information about the user:

User Data

  • User ID #
  • Username
  • Email address (if we have it)
  • Email verification date (if we have it)
  • Account registration date
  • Timestamp of latest edit (on current wiki)
  • User rights conferred and/or user group memberships (on current wiki)
  • Preferences (on current wiki, including hidden preferences—just a dump)

Most of these are already provided by https://www.mediawiki.org/wiki/API:Userinfo. The ones that aren't are:

  • Email verification date (if we have it)
  • Timestamp of latest edit (on current wiki)

Does this need QA?

Yes

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Id add to the chorus that json probably makes more sense. CSV is an underspecified format which has many different implementations with different escaping rules, making it less portable than json. Encoding is underspecified making it less compatible/portable than json, it assumes tabular data, which probably makes it less suitable for most of the usecases here than json. The only real benefit is that its very easy to import into a db or excel and has excellent support in legacy applications.

I agree with @Bawolff. This data doesn't seem very suitable for csv.

jmatazzoni renamed this task from Give users a downloadable CSV of their "User Data' to Give users a downloadable report of their "User Data'.Nov 11 2018, 7:20 PM

Changed the title to "report" from "CSV." if this is going to be machine readable instead of human readable then there is no reason for CSV.

In the Steering meeting yesterday, Moriel made a decision to go with CSV. But we should keep an eye on this; if it becomes clear that it's a big job we can reverse course.

The decision to go with a CSV came because of a discussion that the user should be able to immediately load it into a spreadsheet. If our goal is for the user to use the data, then CSV (or, perhaps, XML?) is the right way to go. But if the goal is to have a machine readable output then JSON is probably the better output.

I'm a little weary of XML for the contribution data (the other ticket) because XML has a lot of extra overhead content, so if the content is already large, adding a bunch of overhead will make the file itself a lot bigger. That, however, is mostly only true for outputs that have potentially large amount of data to output, like the contribution list.

I'm not really sure what the expected usecase is for this feature. If it is expected that the majority of people who consume this data using spreadsheets, than I agree that CSV may make sense, regardless of my distaste for it as a data interchange format.

I'm a little weary of XML for the contribution data (the other ticket) because XML has a lot of extra overhead content, so if the content is already large, adding a bunch of overhead will make the file itself a lot bigger

One thing to keep in mind is that XML will probably compress very well. Although I think JSON is more popular now for basically everything people used to use XML for.

To be honest, this task confuses me a bit in general. The stated value proposition is "As a user, I want to get a report of the personal information that the WMF has about me". But then goes on to talk about (with the exception of email & preferences) information that is all a matter of public record. I assume that a user wanting all the personal information WMF has on them, would be primarily interested in personally-identifying information that is kept privately by WMF and not so much the public contents of Special:Contributions and similar pages.

The decision has already been made to go with JSON. The original confusion came from the misunderstanding between us need in human-readable or machine-readable formats.

As for the purpose of this, it's just one of a series of features for users to extract their data.

It should probably be made very clear to users exactly what data they are not being provided. E.g. it sounds like this will not include their CU data, and various deleted/suppressed things about them that are actually still held by the site.

@Krenair That's a good point I think. An included readme/explanation describing the boundaries of the included information seems wise.

P.S. I'd advise to also include file uploads in such a report.
And we have the optional gender.. probably also belongs in there.

I want to point out that the word "report" here might be problematic. We aren't going to deliver a report but will provide the actual stored data. So, it's not just a list but is more like a data dump.

Hopefully, that makes it a bit clearer about the goals here.

Okay but crucially it's not a data dump of *all* data relating to them, is it? There are some deliberate omissions?

It's the first iteration of such a tool. Future iterations may get closer to *all* than this first iteration will.

In other words, we're endeavoring to deliver data that is reasonably gathered and then add new sources of data as we get feedback and solve some technical issues (like how to give users large amounts of data).

Okay. Will it be clear to the users that this is essentially a beta with data possibly missing? And that it is known that certain types of data are missing (deliberately excluded) and that some data may be missing in error, with more to come in future?

I wouldn't call it a beta personally but I take your point.

I'll leave it to @jmatazzoni to answer how we'll message users about what the tool does.

jmatazzoni renamed this task from Give users a downloadable report of their "User Data' to Give users a download of their "User Data'.Nov 15 2018, 4:39 PM

I've changed the name of the ticket again, to be more accurate and less "report-y"

@kaldari Can we give people the IP address associated with their account (at present—I believe this changes, right)?

! In T208636#4742023, @TheDJ wrote:

And we have the optional gender.. probably also belongs in there.

That will be part of your Preferences, which are included.

@jmatazzoni - That proposal was rejected by the Architecture Committee (see T387). Otherwise, we could have just deployed the AccountInfo extension.

@jmatazzoni - That proposal was rejected by the Architecture Committee (see T387). Otherwise, we could have just deployed the AccountInfo extension.

Do you mean about the IP address?

Yes, the Architecture Committee decided against making users' IP addresses available. You can read through T387 for the gory details. It could perhaps be revisited, but that was the last decision on the matter.

kaldari renamed this task from Give users a download of their "User Data' to Give users a download of their "User Data".Nov 20 2018, 7:48 PM

Rather than simply download a json to the user, I think we should provide a .zip file with one folder per wiki.

  • The users expect "their data", a zip is an archiver format (maybe a silly expectation, but I guess that for naive users it "makes sense").
  • The help page T210007 (or a simplified version of it) can be included with the data. The user won't need to go hunting the format used on production when looking at the user data he downloaded 20 years 2 months ago.
  • These archives will get stored as backups (probably not even read). It makes sense to provide them in a compressed format.
  • Extensible. We can easily add or remove pieces, and not just between software versions. Choosing between downloading just the user data (T208636), the contributions data (T210007) or both is only a matter of which files are included in the zip.
  • Supports multiple formats: user preferences are clearly a json, while watchlist or list of edit could be a .txt with urls separated by CRLFs
  • Multiwiki: the wmf may provide everything in a single dump (at least, this would be the user expectation), while third-party installs will be single-wiki. If multiwiki downloads are just N folders, while on a single wiki there would be just one folder, and the contents would be otherwise the same, that's much easier than having a complete separate format (wrap everything into an array?) when there are multiple wikis. [Naturally, for the sake of efficiency, we will want a radio defaulting to "just this wiki, please" in the UI]

Thanks for those suggestions @Platonides. At this point our goal is to get this out expeditiously. Once we see how much interest users have in this feature, we’ll know how far we should go to make improvements—and which improvements people value most.

Cool project! There should probably be some hook system so extensions can add in more info (and maybe form controls so the user can select which type of data they want, although that might be overcomplicating it). On Wikimedia wikis that extension would be ReadingLists (maybe Echo, although notifications do not seem like useful data; maybe CheckUser, but T387 would have to be renegotiated; maybe ContentTranslate which I believe has private drafts).

Yes, the Architecture Committee decided against making users' IP addresses available. You can read through T387 for the gory details. It could perhaps be revisited, but that was the last decision on the matter.

It was discussed long before GDPR though so the landscape has definitely changed since then.

This may be obvious. We should have a check deep in the class/function that is grabbing this data to ensure that the user data we are gathering is for the same user that is requesting it. This check should likely happen at multiple layers in the code so we can be as defensive as possible in preventing leakage.

We should think more about how to deal with the possibility that gathering this data will take more time than we want to block the user's browser for. Additionally, if we are collecting all this data into memory before writing it back to the browser as a CSV download or whatever, there could be some performance concerns there.

In an ideal world, we'd offload this to an async process and store the file in a URL-addressable object storage system and then have the UI simply show the user a link when the file is available. However, we (the Foundation) are missing some technical pieces to make that easy to do. So, we may need to get creative.

This may be obvious. We should have a check deep in the class/function that is grabbing this data to ensure that the user data we are gathering is for the same user that is requesting it. This check should likely happen at multiple layers in the code so we can be as defensive as possible in preventing leakage.

We should think more about how to deal with the possibility that gathering this data will take more time than we want to block the user's browser for. Additionally, if we are collecting all this data into memory before writing it back to the browser as a CSV download or whatever, there could be some performance concerns there.

In an ideal world, we'd offload this to an async process and store the file in a URL-addressable object storage system and then have the UI simply show the user a link when the file is available. However, we (the Foundation) are missing some technical pieces to make that easy to do. So, we may need to get creative.

@aezell, do you want to add something about these requirements to the Description, above, so that they become part of the spec?

Also, thanks for mentioning the idea of user feedback. If you think this might take a while, I will ask Prateek about adding a "working" animation to T208889.

In an ideal world, we'd offload this to an async process and store the file in a URL-addressable object storage system and then have the UI simply show the user a link when the file is available.

The upload stash could probably be abused to do that.
More generally, the file could be put in some private container in Swift and then proxied by MediaWiki.

The upload stash could probably be abused to do that.

I'll have to look up more about that. I don't know what that is.

More generally, the file could be put in some private container in Swift and then proxied by MediaWiki.

I wasn't aware that we maintained a Swift cluster anywhere.

This line of thought makes me wonder about how to make this reasonable for MediaWiki users that don't have such systems in place.

We serve images and thumbnails from Swift (there are docs but they are pretty bad). There is an abstraction layer (FileBackend) so on normal wikis it just goes to disk. I'm not sure how easy it is to manage visibility via that abstraction layer (@aaron probably knows more, he is the primary author of the file backend code). Then again, this functionality looks like it will be not really widely applicable, given the aversion to address real-world use cases, so just name the extension WikimediaSomething and no need to bother with third-party users, IMO.

The upload stash is used for chunked uploads (required above some size limit, 100MB if I remember correctly) - the chunks are assembled in a staging area, and then a separate API request is used to publish the file. (Again, docs exist but are poor.) There is some cronjob to clean up old files. In theory the stash is private, in practice it is not really security-sensitive (it is used to stash files which are going to be published soon) so if it is used, someone from the security team should probably give it a long hard look... Also we have an entirely different set of security checks for malicious file content, and I'm not sure if those are applied on publishing or on uploading (although that's only a problem if the user data export format is something you couldn't upload as a public file). @Bawolff might be the person who knows this area best. Worst case, you can use it as a template of how to interact with FileRepo/FileBackend for storing private files.

This may be obvious. We should have a check deep in the class/function that is grabbing this data to ensure that the user data we are gathering is for the same user that is requesting it. This check should likely happen at multiple layers in the code so we can be as defensive as possible in preventing leakage.

We should think more about how to deal with the possibility that gathering this data will take more time than we want to block the user's browser for. Additionally, if we are collecting all this data into memory before writing it back to the browser as a CSV download or whatever, there could be some performance concerns there.

I agree with that completely, but isn't that why we've originally agreed to only serve straight-forward data that, for the most part, is already viewable on the screen (and whatever isn't, seems to be possible to query quickly from the database) ?
I think this would become a big problem if we allow any other data to be added in.

That's to say -- if we decide that we are only interested in basic user data (which is what appears up on the spec) and our priority for the start is a simple quick data fetch then we can limit what data we throw at that dump to fit this functionality, rather than try to come up with ways to store/serve files so that we can include bigger data.

Bigger data would also mean bigger loads on our database, so the solutions we'll need would likely involve more than "just" async serving the file.

I suggest we gather a set of this data for a "representative" user via manual means (sql queries) and see just how much data this is before we go into all this work.

I suspect the actions logged about a user may be the first thing we cut here for expediency.

My suggestion is to drop logged actions completely. As far as I know and as far as I've seen, logged actions are indexed by the performer, not by the user that the action was made to which means that even the action of collecting this data will be risky performance wise, and will increase the level of complexity to consider async delivery and then storage/delivery of files, etc.

  • Without logged actions, from a quick glance over the other spec, this is a fairly straight forward operation that would most likely be done in memory and served to the user immediately.
  • With logged actions to the user, we increase complexity, performance concerns, storage/serving systems, etc, which would make the effort about 3x harder and more elaborate.

Considering the note about whether it's "hard", I leave that consideration to @jmatazzoni whether the product requires it or not.

In T208636#4992546, @Mooeypoo wrote:

My suggestion is to drop logged actions completely. As far as I know and as far as I've seen, logged actions are indexed by the performer, not by the user that the action was made to which means that even the action of collecting this data will be risky performance wise, and will increase the level of complexity to consider async delivery and then storage/delivery of files, etc.

  • Without logged actions, from a quick glance over the other spec, this is a fairly straight forward operation that would most likely be done in memory and served to the user immediately.
  • With logged actions to the user, we increase complexity, performance concerns, storage/serving systems, etc, which would make the effort about 3x harder and more elaborate.

@kaldari and @DannyH, the passage below, about tracking logged actions performed ON the user, was originally part of the Description of this task. As you can see, the last bullet point records your advice back when we were formulating the plan that if this feature turned out to be difficult, it could be dropped from the spec. Based on Moriel's note above, which says the this project is "3x harder and more elaborate" with that feature, I have removed it from the spec. Please let me know if you have any comments re. dropping this.

Logged Actions done TO THE USER, e.g.,

  • User group changes (including comments unless suppressed) [There is user_former_groups that we can grab this from]
  • Number of times blocked; info about the blocks (including comments unless suppressed)
  • [Note: logged actions peformed on the user may be difficult to to round up. We should check on how feasible this is but Danny and Ryan affirm that if it is hard we can leave it out.]

I removed these two requirements from the Description above and am moving them to a separate ticket.

  • Global user groups joined (e.g. global interface editor)
  • Wikis that the user has an account on [Not absolutely required, but the idea is to tell them this as an alternative to actually going and fetching all the global data]

Joe: It's totally okay to leave logged actions out. We can revisit this decision post-release, if we hear from users that they really want logged actions. I think it's likely that we won't get many post-release feature requests.

kaldari renamed this task from Give users a download of their "User Data" to Create an API for a user to download their "User Data".Mar 19 2019, 11:00 PM
kaldari updated the task description. (Show Details)
kaldari updated the task description. (Show Details)
kaldari updated the task description. (Show Details)
kaldari renamed this task from Create an API for a user to download their "User Data" to Allow users to access their user data via API.Mar 19 2019, 11:38 PM
kaldari updated the task description. (Show Details)

Change 498520 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/core@master] Add meta=userinfo&uiprop=latestcontrib

https://gerrit.wikimedia.org/r/498520

The latest contribution patch is above while it turns out the email confirmation date is returned when the email itself is requested.

jmatazzoni renamed this task from Allow users to access their user data via API to Allow users to access their user data via an API.Mar 26 2019, 10:05 PM

Change 498520 merged by jenkins-bot:
[mediawiki/core@master] Add meta=userinfo&uiprop=latestcontrib

https://gerrit.wikimedia.org/r/498520

dom_walden subscribed.

latestcontrib will return the last entry for that user or IP in the revisions table.

@jmatazzoni It does not matter if that revision has been deleted. You can change the visibility of the latest revision so that it does not appear in Special:Contributions/$user, but the date of that revision will be returned by the API. Is that OK?

When does https://www.mediawiki.org/wiki/API:Userinfo need to be updated? (And is there other documentation that needs updating?)

Value proposition

As a user, I want to get a machine-readable export of the personal information that the WMF has about me. It should be possible through the Mediawiki API to get a JSON export of the data (after authenticating as the user).

Uses the standard JSON response from the API. The date format of latestcontrib is the same as other date formats in the API.

If a user has made no contributions on the wiki they are doing the API call to, latestcontrib field will not be returned.

This is consistent with some fields (such as emailauthenticated and registrationdate if the user is anonymous). But, this is not consistent with other fields (such as email where {"email": ""} is returned for anonymous users). I don't know if we are working to a particular standard here, or if this fact just needs to be documented.

Contents of the report

The API response should provide the following information about the user:

User Data

  • User ID #
  • Username
  • Email address (if we have it)
  • Email verification date (if we have it)
  • Account registration date
  • Timestamp of latest edit (on current wiki)
  • User rights conferred and/or user group memberships (on current wiki)
  • Preferences (on current wiki, including hidden preferences—just a dump)

...

  • Email verification date (if we have it)
  • Timestamp of latest edit (on current wiki)

I have checked that all of these can be returned by a call such as: https://en.wikipedia.beta.wmflabs.org/w/api.php?action=query&meta=userinfo&uiprop=latestcontrib|email|registrationdate|rights|groups|options