Page MenuHomePhabricator

Add ipblocks_restrictions table to Data Lake
Closed, ResolvedPublic3 Estimated Story Points

Description

Hello!

The Anti-Harassment Tools team recently introduced the ipblocks_restrictions table. We want to perform some analysis on the feature's usage and need the copies to be in the Data Lake on a monthly basis.

Thank you!

Event Timeline

TBolliger moved this task from Untriaged to Tracking work by others on the Anti-Harassment board.

We're aware your team is working on T209031, so we politely request that this task be added to your KanBan board as soon as convenient. We are looking to perform some analysis on this data to inform next steps for our products' rollout. Thank you!

FYI that mediawiki_ipblocks is scooped monthly in case that data is of any help.

@TBolliger thanks for paying such close attention to our work! As part of that task, we've decided to just import all tables from the production replicas. I'm adapting our sqoop job to do that now, but I'm relying on this list to filter out columns: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/mariadb/filtered_tables.txt$414

@Marostegui do you have plans to add ipblocks_restrictions to filtered_tables.txt? If you have any patches along those lines, let us know and @TBolliger can take a look to see if the subset of columns marked with K (Keep) are enough for their analysis needs.

@Marostegui do you have plans to add ipblocks_restrictions to filtered_tables.txt? If you have any patches along those lines, let us know and @TBolliger can take a look to see if the subset of columns marked with K (Keep) are enough for their analysis needs.

I had no plans as that is not for me to decide - by default we don't expose new tables on the views unless specified otherwise.
If you feel the table needs to be added to the views, you'd need to either filter the columns (those that need to be filetered - with a patch to filtered_tables.txt), then we need to sanitize it on labs, and then you'd need to talk to cloud-services-team to get a view on that table.

What functionality/possibilities would we get from adding a table to filtered_tables? Sorry for the question, and thank you in advance for your answer — this is my first time dealing with the Data Lake.

None really.
filtered_tables.txt is a file we use for sanitization to and put column triggers in place for those columns that should not be exposed on labs, so the trigger will sanitize the column on every write arriving to labs.
Once that is done, the table will still be unavailable on labs unless it gets a view (for the full table or for specific columns) in place.

I think we are mixing things here, scooping mediawiki tables to data lake on hadoop (private cluster) and having that data available in labs (public, sanitized data). @Tbollinger request has to do with having data in hadoop such is "joinable" with other data that already exists there. This does not necessarily imply that the data will be available on labs as data scooped might be of private nature.

From some discussions, I believe we'll want to add ipblocks_restrictions to filtered_tables.txt so my team can compare monthly trends. Is this work for my team to do, or is this something Analytics Engineering handles?

Thank you!

From some discussions, I believe we'll want to add ipblocks_restrictions to filtered_tables.txt so my team can compare monthly trends. Is this work for my team to do, or is this something Analytics Engineering handles?

Thank you!

I don't really understand what filtered_tables.txt you guys expect to get from adding the table there. This file is only used to generate triggers and sanitize columns on labs.

I don't really understand what filtered_tables.txt you guys expect to get from adding the table there. This file is only used to generate triggers and sanitize columns on labs.

We expect to get ipblocks_restrictions added to the replicas. It was mentioned earlier that new tables are excluded from the replicas by default and we need to explicitly “unfilter”(?) the columns we want by modifying filtered_tables.txt. Is this wrong?

I don't really understand what filtered_tables.txt you guys expect to get from adding the table there. This file is only used to generate triggers and sanitize columns on labs.

We expect to get ipblocks_restrictions added to the replicas. It was mentioned earlier that new tables are excluded from the replicas by default and we need to explicitly “unfilter”(?) the columns we want by modifying filtered_tables.txt. Is this wrong?

That is not correct.
Tables do not get exposed by default to prevent accidental data leaks.
Note that "exposed" is different from "sanitized".

Exposed means that there are no views on that table, so it can not be read from labs views. If you need a new view on that table (which exists on labs already) you need to talk to cloud-services-team. Please talk to Security before to make sure it is ok to expose the table.
Sanitized means that that table has some data that SHOULD NOT be on labs (not even present), if that is the case, you need to add the columns that need to be redacted to filtered_tables.txt. If you are not sure if the data should be present, talk to Security please.

That is not correct.
Tables do not get exposed by default to prevent accidental data leaks.
Note that "exposed" is different from "sanitized".

Exposed means that there are no views on that table, so it can not be read from labs views. If you need a new view on that table (which exists on labs already) you need to talk to cloud-services-team. Please talk to Security before to make sure it is ok to expose the table.
Sanitized means that that table has some data that SHOULD NOT be on labs (not even present), if that is the case, you need to add the columns that need to be redacted to filtered_tables.txt. If you are not sure if the data should be present, talk to Security please.

My apologies. I didn't read all your replies on this thread. I don't see the table view so I've created T209819: Expose new ipblocks_restrictions table to Wiki Replica users to keep track of this and move this conversation away from this ticket.

Milimetric raised the priority of this task from Medium to High.
Milimetric added a project: Analytics-Kanban.
Milimetric set the point value for this task to 3.
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 493286 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] Add sqoop queries for ipblocks_restrictions table

https://gerrit.wikimedia.org/r/493286

Change 493407 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] Add ipblocks_restrictions table to labs sqoop list

https://gerrit.wikimedia.org/r/493407

Change 493407 abandoned by Joal:
Add ipblocks_restrictions table to labs sqoop list

Reason:
Already done in https://gerrit.wikimedia.org/r/c/operations/puppet/ /493331/

https://gerrit.wikimedia.org/r/493407

Change 493286 merged by Joal:
[analytics/refinery@master] Add ipblocks_restrictions table to monthly sqoop

https://gerrit.wikimedia.org/r/493286