Page MenuHomePhabricator

Drop (and archive?) aft_feedback
Closed, ResolvedPublic

Description

T250701: Restart extension1 (x1) database primary master (db1120) mentions aft_feedback is still in production.... But the extension was undeployed a while ago

Event Timeline

Marostegui triaged this task as Medium priority.Apr 20 2020, 4:40 PM
Marostegui added a subscriber: Marostegui.

And it has indeed not being written for a long time:

root@db1120:/srv/sqldata/enwiki# ls -lh aft_feedback.ibd
-rw-rw---- 1 mysql mysql 848M Jul 25  2018 aft_feedback.ibd

@jcrespo where would you like to get these tables archived at?

root@db1120:/srv/sqldata# find . -name aft_feedback.ibd  | xargs ls -lh
-rw-rw---- 1 mysql mysql  60M Jul 25  2018 ./dewiki/aft_feedback.ibd
-rw-rw---- 1 mysql mysql 848M Jul 25  2018 ./enwiki/aft_feedback.ibd
-rw-rw---- 1 mysql mysql 252M Jul 25  2018 ./frwiki/aft_feedback.ibd
-rw-rw---- 1 mysql mysql 272K Jul 25  2018 ./testwiki/aft_feedback.ibd

If public, on dumps, if private, on archive backup system. If mixed, someone to sanitize them, then a copy to each place.

I can't find the task for dropping of the aft tables.... Only T92739: Remove AFT tables from the analytics slaves

Security-Team advise on whether the content of this table can be public or it needs to be redacted? Thanks!

Security-Team advise on whether the content of this table can be public or it needs to be redacted? Thanks!

Is it possible for the Security-Team to get access to the aft_feedback dump somewhere? This will likely be reviewed at our clinic meeting next Monday, April 27th, given the holidays the rest of this week.

Security-Team advise on whether the content of this table can be public or it needs to be redacted? Thanks!

Is it possible for the Security-Team to get access to the aft_feedback dump somewhere? This will likely be reviewed at our clinic meeting next Monday, April 27th, given the holidays the rest of this week.

We haven't extracted the data, does your team have access to the table itself on x1? If not, I can probably take a few lines out and leave them maybe somewhere (a /home directory in mwmaint1002?)

We haven't extracted the data, does your team have access to the table itself on x1? If not, I can probably take a few lines out and leave them maybe somewhere (a /home directory in mwmaint1002?)

Not sure I have access to x1 - maybe @Reedy or @chasemp does? But my home dir on mwmaint1002 definitely works for me, assuming no size limitations. And I'd guess that maybe a thousand or two thousand rows might be enough for us to develop a good understanding of exactly the type of data within the table, along with likely (potentially sensitive) data.

We haven't extracted the data, does your team have access to the table itself on x1? If not, I can probably take a few lines out and leave them maybe somewhere (a /home directory in mwmaint1002?)

Not sure I have access to x1 - maybe @Reedy or @chasemp does? But my home dir on mwmaint1002 definitely works for me, assuming no size limitations. And I'd guess that maybe a thousand or two thousand rows might be enough for us to develop a good understanding of exactly the type of data within the table, along with likely (potentially sensitive) data.

sql wikishared should do it

wikiadmin@10.64.0.97(wikishared)> show tables;
+---------------------------------------+
| Tables_in_wikishared                  |
+---------------------------------------+
| bounce_records                        |
| cx_corpora                            |
| cx_lists                              |
| cx_suggestions                        |
| cx_translations                       |
| cx_translators                        |
| echo_unread_wikis                     |
| reading_list                          |
| reading_list_entry                    |
| reading_list_project                  |
| urlshortcodes                         |
| wikimedia_editor_tasks_counts         |
| wikimedia_editor_tasks_edit_streak    |
| wikimedia_editor_tasks_keys           |
| wikimedia_editor_tasks_targets_passed |
+---------------------------------------+
15 rows in set (0.00 sec)

wikiadmin@10.64.0.97(wikishared)> use enwiki;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
wikiadmin@10.64.0.97(enwiki)> show tables;
+-------------------+
| Tables_in_enwiki  |
+-------------------+
| aft_feedback      |
| echo_email_batch  |
| echo_event        |
| echo_notification |
| echo_target_page  |
+-------------------+
5 rows in set (0.00 sec)

Not sure I have access to x1

I checked and you have, as you are in the deployers list. using @ mwmaint1002:

sql wikishared -- -e "SELECT * FROM enwiki.aft_feedback LIMIT 10"

Should give you safe, read only access to those tables (I can still export them for you, I just thought you would thank dynamic access). Also as a security person, I hope you would like we DBAs frowning upon local copies of potentially private data on people's homes :-D. It is still can be done if that makes things easier as long as things are wiped later.

Regarding archiving, I think there is a misunderstanding about our capabilities and its relation with backups. Commenting here for awareness on similar tasks:

There is, at the moment, no proper way to "archive forever" historical data at WMF. While it is true that there is an archival backup, that is good only for small, long term backups (e.g. "keeping something for 5 years in case it is needed beyond the 3 month mark"). Don't get me wrong, for 5 years we will guarantee that survives, but there is no long-than-5-years strategy- those are just backups, not real archival, with a lot of disadvantages for maintenance. That is intended for things like "service decommissioning that may come handy a few years later". There is no good workflow/strategy for "long term/historical archival of content", in my opinion.

Either traditionally, it has been sent there (with no hard promise to be recoverable 5 years later), or in case of public data, to dumps, with the hope that better archival 3rd party people will take over (e.g. archive.org). Or it can be written to a wiki page, where it will be kept for as long as wikis exists. I am not saying archival is not possible, but long-term backups are not archival (they are just a recovery system), and there is, in my opinion, ATM no long term workflows, infrastructure or support for such a service. Private data long term archival is a _hard_ task we have not solved yet. (example- technologies in 10 years change, and encryption algorithms and underlying technology is difficult to maintain working; also would require legal overview, etc.).

Not sure I have access to x1

I checked and you have, as you are in the deployers list. using @ mwmaint1002:

sql wikishared -- -e "SELECT * FROM enwiki.aft_feedback LIMIT 10"

Great, thanks.

Should give you safe, read only access to those tables (I can still export them for you, I just thought you would thank dynamic access). Also as a security person, I hope you would like we DBAs frowning upon local copies of potentially private data on people's homes :-D. It is still can be done if that makes things easier as long as things are wiped later.

Nope, not necessary if I can get everything we need via wikishared, which it looks like I can. I don't necessarily like the idea of dropping random data files here and there when it can be avoided, but if there wasn't an easier approach, we'd have to do that.

Hi @jcrespo the Security-Team reviewed this table and we're fine with making the data public with the caveat that anything where aft_hide = 1 is removed from the data (in other words, it can be permanently deleted). In most cases that hidden data appears to be spam, but in a few instances it looks like attempts to publish private data (phone number, physical address, etc.). From a review of the extension code it appears that most of the rest of the data was already public while the extension was live, and the UX clearly indicated to users that feedback would be made public. Please advise if you have any further questions/issues related to this table.

JFishback_WMF moved this task from Backlog to Completed on the Privacy Engineering board.
JFishback_WMF moved this task from Back Orders to Our Part Is Done on the Security-Team board.

Thanks, useful evaluation and notice, @JFishback_WMF, taking it from here to generate the exports without the problematic rows.

Manuel had already done a check before, but double checking again on all servers:

root@db1120.eqiad.wmnet[(none)]> select table_schema from information_schema.tables where table_name='aft_feedback';
+--------------+
| table_schema |
+--------------+
| dewiki       |
| enwiki       |
| frwiki       |
| testwiki     |
+--------------+
4 rows in set (0.03 sec)

I searched over all other sections, including m5 (wikitech) and didn't find it elsewhere. I did this because I am aware that some private wikis do not use x1 section, and store some tables on the same servers as the rest of data.

root@cumin1001:~$ table='aft_feedback'; mysql.py -BN -h db1120 -e "select table_schema from information_schema.tables where table_name='$table'" | while read db; do echo "Dumping $db.$table..."; mysqldump --single-transaction -h db1120.eqiad.wmnet --where="aft_hide = 0" $db $table | pv | pigz -c > $db.$table.sql.gz; done
Dumping dewiki.aft_feedback...
11.1MiB 0:00:00 [30.6MiB/s] [ <=>                                                                                      ]
Dumping enwiki.aft_feedback...
 233MiB 0:00:04 [52.3MiB/s] [        <=>                                                                               ]
Dumping frwiki.aft_feedback...
62.4MiB 0:00:01 [36.3MiB/s] [   <=>                                                                                    ]
Dumping testwiki.aft_feedback...
37.8KiB 0:00:00 [ 646KiB/s] [ <=>

To double check that the export did not contain non public data/spam, as adviced by @JFishback_WMF, I imported those dumps into a separate database:

Live DB results:

root@db1120.eqiad.wmnet[enwiki]> select count(*) FROM aft_feedback where aft_hide = 0;
+----------+
| count(*) |
+----------+
|   920485 |
+----------+
1 row in set (0.28 sec)

root@db1120.eqiad.wmnet[enwiki]> select count(*) FROM aft_feedback where aft_hide <> 0;
+----------+
| count(*) |
+----------+
|    18830 |
+----------+
1 row in set (0.27 sec)

root@db1120.eqiad.wmnet[enwiki]> select * FROM aft_feedback where aft_comment like '%[SPAM LINK OMITTED]%';
...
1 row in set (1.05 sec)

Imported results:

mysql> select count(*) FROM aft_feedback;
+----------+
| count(*) |
+----------+
|   920485 |
+----------+
1 row in set (1.42 sec)

mysql> select count(*) FROM aft_feedback where aft_hide <> 0;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.42 sec)

mysql> select * FROM aft_feedback where aft_comment like '%[SPAM LINK OMITTED]%';
Empty set (1.41 sec)
jcrespo moved this task from Backlog to Blocked external/Not db team on the DBA board.
jcrespo added a subscriber: ArielGlenn.

CC @ArielGlenn https://dumps.wikimedia.org/other/aft_feedback/20200506/

Only assigning to you for you to ack the above. DBA part (filtering and dumping) done.

@Reedy Please assign to us again when happy for dropping, Manuel will want to take care of that part, probably.

Let's add that to the 'other' index.html page too, or no one will know it's there. Can someone supply a phrase describing the contents, for downloaders?

Let's add that to the 'other' index.html page too, or no one will know it's there. Can someone supply a phrase describing the contents, for downloaders?

Plenty of snarky ones I could suggest...

Something like should be good

"Article feedback collected using the <a href="https://www.mediawiki.org/wiki/Extension:ArticleFeedbackv5">ArticleFeedbackv5 extension</a>"

@Reedy Please assign to us again when happy for dropping, Manuel will want to take care of that part, probably.

I think the only missing thing might just be a quick sanity check of the dumps, then they should be good to be dropped

I think the only missing thing might just be a quick sanity check of the dumps

Note I did one at T250715#6113043 (the imported results are after reimporting the dumps back to a different mysql instance), but of course not hurting/against a second check by someone else.

Change 594961 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] Add link to Article Feedback dumps for downloaders

https://gerrit.wikimedia.org/r/594961

No internal links @Reedy, sorry ;-) The above can go as soon as someone gives the final thumbs up.

Soooo... any approval forthcoming or should I just merge this? @Reedy what do you think?

Change 594961 merged by ArielGlenn:
[operations/puppet@production] Add link to Article Feedback dumps for downloaders

https://gerrit.wikimedia.org/r/594961

Ok, merged., thanks for the +1's. What's next on this task?

LSobanski added a subscriber: LSobanski.

@Reedy are we ok to drop aft_feedback now?

LSobanski lowered the priority of this task from Medium to Low.Apr 26 2021, 11:51 AM
LSobanski moved this task from Refine to Blocked on the DBA board.
Reedy added a project: SecTeam-Processed.

Yeah, LGTM. Feel free to drop whenever you get round to it.

Mentioned in SAL (#wikimedia-operations) [2021-08-16T07:47:12Z] <marostegui> Rename aft_feedback tables on db2115, db2131 - T250715

To double check, tables haven not been written again:

root@db1120:/srv/sqldata# find . -name aft_feedback.ibd | xargs ls -lh
-rw-rw---- 1 mysql mysql  60M Jul 25  2018 ./dewiki/aft_feedback.ibd
-rw-rw---- 1 mysql mysql 848M Jul 25  2018 ./enwiki/aft_feedback.ibd
-rw-rw---- 1 mysql mysql 252M Jul 25  2018 ./frwiki/aft_feedback.ibd
-rw-rw---- 1 mysql mysql 272K Jul 25  2018 ./testwiki/aft_feedback.ibd

I have renamed the tables on the active x1 slaves: db2131 and db2115, and if nothing breaks in 24h, I will drop them:

# for i in dewiki enwiki frwiki testwiki; do echo "Database=$i"; mysql.py -BN -hdb2131 $i -e "show tables like '%_aft_feedback'";done
Database=dewiki
T250715_aft_feedback
Database=enwiki
T250715_aft_feedback
Database=frwiki
T250715_aft_feedback
Database=testwiki
T250715_aft_feedback

Mentioned in SAL (#wikimedia-operations) [2021-08-17T07:44:09Z] <marostegui> Drop aft_feedback tables on x1 T250715

All done - thanks everyone who worked on this task!