Page MenuHomePhabricator

Omit private data from being generated during dump runs
Closed, ResolvedPublic

Description

I was just looking through the dumps that we generate and store (e.g., https://dumps.wikimedia.org/enwiki/20161120/) and noticed that many items are marked "(private)" and unlinked. IRC conversation with @ArielGlenn revealed that the files are unlinked and stored in a separate directory outside of the web tree.

Is it possible to avoid generating these archives containing private data, so that they never hit the server?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 30 2016, 6:06 PM
ArielGlenn added a project: Dumps-Generation.
ArielGlenn changed the visibility from "Custom Policy" to "All Users".Nov 30 2016, 6:14 PM
ArielGlenn added a project: Analytics.

I've added Analytics to see if they know of any users of these files on the stats* hosts (they would have to be rsynced specially from the dataset host). If they don't, I'll update the code so we skip generation.

@ArielGlenn: Was your visibility change on this task intentional? If so, please set it to "Public" instead of "All users".

Yes it was, ok fixing.

ArielGlenn changed the visibility from "All Users" to "Public (No Login Required)".Nov 30 2016, 6:18 PM
Bawolff added a subscriber: Bawolff.Dec 1 2016, 4:13 AM

Arent these important as a sort of backup of last resort?

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Dec 1 2016, 12:07 PM

Change 324702 had a related patch set uploaded (by ArielGlenn):
allow dumps of private tables to be skipped via config setting

https://gerrit.wikimedia.org/r/324702

Arent these important as a sort of backup of last resort?

I don't think so. We have db backups for that. @jcrespo can say more about those.

The gerrit changeset could be merged at any time; it's been tested, and won't actually change anything without a corresponding update of the configs in puppet.

I don't think so. We have db backups for that. @jcrespo can say more about those.

I trust only the dumps as a last resort backup for external storage and very old revision data; for relational data (users, groups, metadata, etc.) we have 6-month retention of weekly backups.

Nuria added a subscriber: Nuria.Dec 5 2016, 4:45 PM

We have no knowledge as to what those files are used for. Untagging analytics

I've sent email to the analytics and research internal lists asking if anyone uses these files, so they can be directed to other forms of the data.

Pine added a subscriber: Pine.Dec 7 2016, 6:59 PM

I don't think so. We have db backups for that. @jcrespo can say more about those.

I trust only the dumps as a last resort backup for external storage and very old revision data; for relational data (users, groups, metadata, etc.) we have 6-month retention of weekly backups.

Hi @jcrespo. Does does the relational data consist partially or fully of information that is considered "personal information" under https://meta.wikimedia.org/wiki/Data_retention_guidelines?

leila added a subscriber: leila.Dec 7 2016, 8:39 PM

I was wrong on my previous statement (it was an informal question, and I
didn't check the exact time). Backup retention is 60 days, not 6 months-
you can see on the history it is not a recent change:

https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/manifests/backup/director.pp;e46f56e88de3191731440c162b622c40bfaa650a$48

It would be currently technically impossible to have a longer retention
with the current backup capacity:

https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=helium&var-network=eth0&panelId=17&fullscreen
https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=heze&var-network=eth0&panelId=17&fullscreen

Database backups do not create or gather more data than existing live on
the current database. Only non-public-non-willingly-PII stored there is rc
ips, which are only stored for a month. Backups are not accesible to anyone
except site operators, and are stored using strong encryption. Recovery
requests are logged for audit.

Also, from the page you refer to:

Exceptions to these guidelines
Data may be retained in system backups for longer periods of time, not to

exceed 5 years.

Those 5 years currently only apply to long term, one time archival of files
(not for databases).

I don't think so. We have db backups for that. @jcrespo can say more about those.

I trust only the dumps as a last resort backup for external storage and very old revision data; for relational data (users, groups, metadata, etc.) we have 6-month retention of weekly backups.

Ah. I got confused and thought we were talking about dumps of private wikis (instead of the private tables of public wikis). We're specifically talking about user, watchlist, ipblocks, archive, logging, and oldimage on public wikis, AFAICT. I'd be a little worried about losing the long-term-if-everything-else-fails-and-the-world-explodes back ups of page text of private wikis, but doesn't seem like there is much to worry about for these tables that this bug is about, and the less places the info in the user table is unnecessarily saved to, the better.

Bawolff triaged this task as Normal priority.Dec 13 2016, 9:35 PM

Change 324702 merged by ArielGlenn:
allow dumps of private tables to be skipped via config setting

https://gerrit.wikimedia.org/r/324702

Everything is merged and deployed and will be in effect for tomorrow's run. I'll keep an eye on it to make sure it runs properly.

Please also have a look at T153633 concerning central auth table dumps; I'd like to get rid of those too if possible.

Nuria removed a subscriber: Nuria.Dec 19 2016, 6:56 PM
ArielGlenn closed this task as Resolved.Dec 20 2016, 12:22 PM

Verified that private tables are no longer being dumped. Closing.

ArielGlenn moved this task from Active to Done on the Dumps-Generation board.Dec 20 2016, 12:22 PM