Page MenuHomePhabricator

Sanitise a Bugzilla database dump
Closed, ResolvedPublic

Description

Dump available at http://dumps.wikimedia.org/other/bugzilla/


A sanitised database dump for Bugzilla is needed to keep some functionality that Phabricator does not support.

Examples are storing saved queries, votes and default CCs. This dump would also allow some tool to be made on labs which keeps the functionality while Phabricator lacks the desired need.

http://git.mozilla.org/?p=webtools/bmo/bugzilla.git;a=blob;f=contrib/sanitizeme.pl;hb=HEAD seems like a viable option. Needs ops review and then Sean I assume needs to weigh on some how so I am just CCing just in case.

Event Timeline

JohnLewis renamed this task from Sanitise a Bugzilla database du,p to Sanitise a Bugzilla database dump.
JohnLewis raised the priority of this task from to Low.
JohnLewis updated the task description. (Show Details)
JohnLewis changed Security from none to None.

The script looks reasonable. Where's a good place to test it though? It feels wrong to just use db1001 but also to copy the private data to labs or my personal laptop.

while testing the script:

./sanitizeme.pl --dry-run

  • dry run : no changes to the database will be made **

Deleting product 'Security'...
DBD::mysql::st execute failed: The MariaDB server is running with the --read-only option so it cannot execute this statement [for Statement "INSERT INTO audit_log (user_id, class, object_id, field,

so "no changes" and "dry-run" means "but i still have to INSERT" ? lol?

now that we can write:

Deleting product 'Security'...
Deleting 0 bugs in security groups...
Done
DBD::mysql::db selectcol_arrayref failed: Table 'bugzilla.longdescs_tags' doesn't exist [for Statement "SELECT comment_id FROM longdescs_tags WHERE tag='deleted'"] at ./sanitizeme.pl line 140
main::delete_deleted_comments() called at ./sanitizeme.pl line 85
eval {...} called at ./sanitizeme.pl line 82

it's from:

sub delete_deleted_comments {
    # Delete all comments tagged as 'deleted'
    my $comment_ids = $dbh->selectcol_arrayref("SELECT comment_id FROM longdescs_tags WHERE tag='deleted'");

if i comment out delete_deleted_comments in main the next error is:

Table 'bugzilla.bug_user_last_visit' doesn't exist

if i also comment delete_bug_user_last_visit, the rest appears to work


  • dry run : no changes to the database will be made **

Deleting 0 bugs in security groups...
Done
Deleting 'insidergroup' comments and attachments...
Deleting security groups...
Deleting sensitive user account data...
Removing attachment data to preserve disk space...
Turning off email delivery...
All done!

our version: 4.4.5 - script says "Last validated against Bugzilla version 4.0"


06:52 -!- Irssi: Join to #bugzilla was synced in 0 secs

06:55 < mutante> hi. i'm trying to use "sanitizeme.pl" to sanitize a BZ database and remove security/private data. running it on a version 4.4.5 and the script is

"4.2/view/head:/contrib/sanitizeme.pl". i get 2 errors about tables not existing: longdesc_tags and bug_user_last_visit

06:55 < mutante> is there a sanitizeme for 4.4 ? or can i skip those? or?
06:55 < mutante> http://bzr.mozilla.org/bmo/4.2/view/head:/contrib/sanitizeme.pl

tried with the newer version that John found and after he hacked it a bit to skip the missing longdesc thing.

now this though:

@zirconium:/srv/org/wikimedia/bzsane# ./sanitizeme317.pl --dry-run
** dry run : no changes to the database will be made **
Deleting 0 bugs in security groups...
Done            
DBD::mysql::db selectcol_arrayref failed: Unknown column 'tag' in 'where clause' [for Statement "SELECT comment_id FROM longdescs WHERE tag LIKE 'deleted'"] at ./sanitizeme317.pl line 154
	main::delete_deleted_comments() called at ./sanitizeme317.pl line 101
	eval {...} called at ./sanitizeme317.pl line 98

the diff here was: https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-PSTE-ta6karxzgl6nfwb/

John,

here's the longdescs table:

mysql:bugs@db2030.codfw.wmnet [bugzilla]> describe longdescs;
+-----------------+--------------+------+-----+---------+----------------+
| Field           | Type         | Null | Key | Default | Extra          |
+-----------------+--------------+------+-----+---------+----------------+
| bug_id          | mediumint(9) | NO   | MUL | NULL    |                |
| who             | mediumint(9) | NO   | MUL | NULL    |                |
| bug_when        | datetime     | NO   | MUL | NULL    |                |
| work_time       | decimal(7,2) | NO   |     | 0.00    |                |
| thetext         | mediumtext   | NO   |     | NULL    |                |
| isprivate       | tinyint(4)   | NO   |     | 0       |                |
| already_wrapped | tinyint(4)   | NO   |     | 0       |                |
| comment_id      | int(11)      | NO   | PRI | NULL    | auto_increment |
| type            | smallint(6)  | NO   |     | 0       |                |
| extra_data      | varchar(255) | YES  |     | NULL    |                |
+-----------------+--------------+------+-----+---------+----------------+
10 rows in set (0.04 sec)

made this modification:

https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-PSTE-laovvu2r7b335db/

to work around the missing "tag" field, supposing it's "isprivate" instead (but there are none that are private anyways).

then ran the script again with "--trace" and "--exec" and this is the result:

https://phabricator.wikimedia.org/P320

without all the extra output from "--trace" it's just:

Deleting 0 bugs in security groups...
Done            
Deleting 'insidergroup' comments and attachments...
Deleting security groups...
Deleting sensitive user account data...
Removing attachment data to preserve disk space...
Turning off email delivery...
All done!
This comment was removed by Dzahn.

and here's a dump of just the table structure without actual data, made with mysqldump --no-data

Leaving my ideas here quickly before I go for @Dzahn (and public reasons):

Get group_id of the security group and then using that id, associate it with bug_id from the "group_bug_map" table. This should then give a list of all bugs considered "private" or "security". Then modify the paste above to use a "SELECT bug_id FROM group_bug_map WHERE group_id = $security_id". Then join these up and the script should be usable as it will delete the bug data from the database.

Verify this is correct before doing as this is my guess with the limited documentation and lack of access to how the bugzilla database looks like in production.

DBD::mysql::db selectcol_arrayref failed: Unknown column 'tag' in 'where clause'

Tags were introduced in 4.4 and are private and per-person. Can only be seen when *logged in*. As the static version won't support login this should be dropped.

DBD::mysql::db selectcol_arrayref failed: Unknown column 'tag' in 'where clause'

Tags were introduced in 4.4 and are private and per-person. Can only be seen when *logged in*. As the static version won't support login this should be dropped.

The problem is that the script is _trying_ to drop them but the column does not exist in the database. (and this was on the 4.4.5 database), so either they were introduced after 4.4.5 or the db schema has changed.

To get this moving (again), here is a new plan of attack:

  • Get the bugid of all security bugs (as the schema is seriously out of date, just grep the docroot for all bugid with an error message as these will be security ones)
  • Create a list of the ids and drop the data from _ALL_ tables where the bug_id is specified so there is no data left with these.
  • Run the sanatise script on a database with the security bugs removed.
  • Done! (security reviews or whatever is deemed necessary of the dump after)

grepping the docroot over trying to figure out the undocumented db schema is a good idea. let's do that. would you have one specific number as an example where i can find the string to look for?

  • Get the bugid of all security bugs

P440 if you want to hardcode (accessible by mutante, not sure how paranoid I should be)

would you have one specific number as an example

See my Paste above (if I understand correctly, if not: ignore me).

Thank you very much Andre, that was helpful and what we wanted.

and this is the way John meant, basically. and it let's us verify this neatly:

[zirconium:/srv/org/wikimedia/static-bugzilla] $ grep "You are not authorized to access bug" bug* | cut -d " " -f12

i amended to the same pastebin

.. and compared our 2 lists with diff after some formatting. it's confirmed identical.

Does the script keep emails in place? Without user IDs, the dump is not worth much. (The link in task description goes nowhere.)

Does the script keep emails in place? Without user IDs, the dump is not worth much.

Depends how strong your use case is vs. the privacy complaint that Bugzilla users brought up for all those years in T2148: Bugzilla email address privacy concerns (together with "requires a separate login" it felt like the most popular complaint).

(The link in task description goes nowhere.)

Thanks, I've updated the link (upstream switched from bzr to git so links broke).

Does the script keep emails in place?

Currently the plan is to include the user ids in the original dump but whether they stay for the public dump depends on how they're viewed in terms of privacy and if they fall under 'removing private data' as to me they are public and acting like they aren't is unhelpful and unnecessary as we all see the email addresses anyway.

This might be a duplicate of T30339 or at least almost. Because only a sanitized db could be used in labs. I don't think actual replication is going to happen though, just this dump since now the db is not going to change anymore.

#!/bin/bash
# delete security bugs from bugzilla database for T85141
# secbugs.list is P440
# dzahn 20150407

declare -a tables=(bugs bugs_activity bugs_fulltext cc longdescs)

for bugid in $(cat secbugs.list); do

    for table in ${tables[@]}; do
        echo "DELETE from ${table} where bug_id=${bugid};"
    done

done
 ./delsecbugs.sh > delsecbugs.sql
mysql -u bugs -p -h db2030.codfw.wmnet --database bugzilla < delsecbugs.sql

@Eloquence @csteipp

I have attempted to sanitize the Bugzilla database by removing security bugs. The goal would be to be able to put a SQL dump file on dumps.wm.org.

Any thoughts on how i should have this verified before i go ahead? I briefly talked to Ori and he suggested i add you guys first and later Luis as well. I have pastebins with the table schema (as created by mysqldump, there is no official one),a list of security bug id's and could provide sample data.

@Dzahn, how did you generate secbugs.list?

In general, greping for a few of the security bug titles in the final output would be a good final check.

I'm assuming you stripped out logincookies and tokens tables too? And stripped out any hidden comments?

For emails, I'd lean towards not including them, since users can't now change the email address associated with their activity, which I think you could do when the system was live?

@Dzahn, how did you generate secbugs.list?

So we have https://static-bugzilla.wikimedia.org/ and that was created by a script which just scraped every single bug ID and it wasn't logged into Bugzilla, so we knew it could just get public stuff. It just incremented the bug ID so i could just

grep "You are not authorized to access bug" bug*

through all of the resulting HTML files.

Addtionally, @Aklapper provided me with a list of IDs (not sure how he created that) and i compared that to my list with diff and they were identical.

P440 and T85141#1155787

I'm assuming you stripped out logincookies and tokens tables too? And stripped out any hidden comments?

We did that using the "sanitizme.pl" script . -> P317 (had to apply this fix: https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-PSTE-laovvu2r7b335db/)

P320 lines 89 to 102

Addtionally, @Aklapper provided me with a list of IDs (not sure how he created that)

Query in old-bz for all tasks in the "Security" product and then exporting the IDs via CSV.

@Slaporte is the tech liaison from legal now that Luis runs community. Daniel and/or Andre, my recommendation would be that you set up a quick time with him to walk him through this data release. If he sees no flags, this is fine with me.

Still pending an approval form @Slaporte (or anyone else from legal who deals with data release).

Still pending an approval form @Slaporte (or anyone else from legal who deals with data release).

You can proceed with the data release, as it includes no non-public personal information.

Using sanitizme.pl seems like the right way to redact this. If that was run, then should be ok for security bugs and deleted comments.

@JohnLewis and @Dzahn asked me to take look over the dumped DB from a sensitive information perspective.

A few thoughts:

  • we should wipe the profile_setting table unless there is some reason to reveal it. This is user specific information and not bug specific.
  • I am not comfortable / can't find my way to a good reason to reveal all of these users emails. It seems we don't require them for the[[ https://static-bugzilla.wikimedia.org/show_bug.cgi?id=72256 | static dump ]] and every user is already uniquely identified by userid. A lot of security and responsible stewardship boils down to the ease of enumeration for user's personal data and here we are changing this factor completely. Before, it was true that a registered user could reveal the immediate user's emails for particular bugs but that is a vast distance from being able to grep all 20014 emails in seconds.

Let's truncate these emails at the @ (login_name field in the profiles table) which I see as the responsible private data stewardship thing to do.

I briefly discussed this with @MoritzMuehlenhoff and AFAIK he is in agreement and he also pointed out that if a need for revealing user emails surfaces we can handle that in a one-off fashion down the road. I am cc'ing him here for his clarification though :)

Yeah, I agree it makes sense to strip email addresses. Let's retain an unmodified copy in case someone really needs to track down/contact a bugreporter on a case-by-case basis, but I don't think someone will actually need that.

The dump will be almost totally useless without full user IDs (i.e. email address) for votes, subscribers and reports, as that's the main public data being lost in the Phabricator migration. There is no point in hiding emails which were public in the first place and keep being exposed elsewhere, e.g. gerrit queries and github.

Additionally, it's not appropriate to use the part before @ as universal "nickname", because many registered with addresses à la bugzilla@example.org.

As mentioned on IRC yesterday I don't have a strong preference either way; after all all of the information (except user profile settings) was exposed via Bugzilla in the past anyway. Maybe it's a sufficient compromise to provide people a simple opt-out; if anyone feels strongly about his/her email being part of the dump, strip it via sed

The dump will be almost totally useless without full user IDs (i.e. email address) for votes, subscribers and reports, as that's the main public data being lost in the Phabricator migration.

Can the userid field be used for that?

I ran this to obfuscate email addresses but keep them unique:

mysql> update profiles set login_name=concat(substring_index(login_name,'@',1), '@', substring(sha1(substring_index(login_name,'@',-1)),1,12));

so it splits the email address at the @, then replaces the second part with the first 12 chars of the SHA1 of the domain name. That means you can still identify who uses the same domain.

ok?

mysql> select userid,login_name,cryptpassword,realname from profiles where login_name like "dzahn%" or login_name like "aklapper%";
+--------+-----------------------+---------------+---------------+
| userid | login_name            | cryptpassword | realname      |
+--------+-----------------------+---------------+---------------+
|  15390 | aklapper@b366761409f8 | deleted       | Andre Klapper |
|  13947 | dzahn@55c824bd3934    | deleted       | Daniel Zahn   |
+--------+-----------------------+---------------+---------------+
2 rows in set (0.04 sec)

Personally, I disagree with the premise that a sanitized dump is needed. This task has already consumed a non-trivial amount of operations time and it's completely unclear to me that anyone would ever actually use this dump if it were made available. For example, are we really expecting any users to troll through a database dump in order to extract a list of bugs that they once voted for? These are pretty contrived use-cases, in my opinion.

I think static-bugzilla and old-bugzilla are sufficient. Users can grab any information they want or need right now. Beyond these two sites, I think it's time for people to move on and to accept the inevitability of death.

I think static-bugzilla and old-bugzilla are sufficient.

The plan is to remove old-bugzilla which was blocked by providing this dump. My intention is _not_ having to support old-bugzilla while keeping static-bugzilla forever.

I think static-bugzilla and old-bugzilla are sufficient.

The plan is to remove old-bugzilla which was blocked by providing this dump. My intention is _not_ having to support old-bugzilla while keeping static-bugzilla forever.

I'm fine with this. The votes and saved queries really aren't worth the trouble here. Anyone who wants this information should be given sufficient opportunity (a few months) to extract it from old-bugzilla. Sanitized dumps and/or database replication for Bugzilla would have been great, but that ship has now sailed and Bugzilla is dead.

Well the sanitized dump is basically done now, so we can release it soon anyways.

After Dzahn dropped the bugs_deleted table, I think this looks ok now.

There are still a lot of full emails around, but I think those are all from places where they would have been public in bugzilla.

Grepping for several security bugs didn't turn up results, so I think everything security related has been removed.

Anyone who wants this information should be given sufficient opportunity (a few months) to extract it from old-bugzilla.

I note this task is not tracked nor completed. If you think that's the way to go, please add to blockers for T95184.

I agree on ops time, but ops time could have been saved by making a robust plan *before* starting work on the proposal T95184 (usual discussion).