Page MenuHomePhabricator

Hide autoblocks from the globalblocks table database dump
Closed, ResolvedPublic

Description

To allow global autoblocks, we need to make public views of the globalblocks table hide the IP address associated with global autoblocks. Prior work on this was done in T371486: Hide the value of gb_address column in public replicas if gb_autoblock_parent_id is not null for the public replicas, but it appears that the database dump takes straight from production DBs instead of the wiki replicas.

To solve this we have a few options:

  1. Appropriately sanitise the gb_address column to remove the IP address
  2. Take the data from the wiki replicas, which is pre-santised
  3. Stop the dumping of the globalblocks table outright, in favour of T218592

Option 1
I'm not sure this would work. The issue with this is that the dump adds the unique index gb_address_autoblock_parent_id, which means that we cannot use an empty string to hide the target of an autoblock (as more than one autoblock in the table would violate the unique constraint)

Option 2
The issue with this solution is that the globalblocks table has changed a lot since T173468. The extension now supports global blocks on accounts, which means that there will be a non insignificant number of irrelevant blocks (because third-party wikis will not have the accounts, unless they intend to clone WMF wikis).

Furthermore, it appears that the current format of the dumps would change. For example, the globalblocks table on the public wiki replicas does not have any indexes (which is why the sanitisation works there) and this would make using this dump on an actual wiki practically impossible (because you need the indexes to make queries efficient enough for use). You can't add back the indexes because of the unique constraint violations.

We could exclude all autoblocks from the dump, but I am concerned that users may use this dump on the assumption that it's the same as the wiki replicas (which it is not because autoblocks wouldn't be present).

Option 3
This is my preferred option, as I do not see a use-case for this dump that could not be solved via the API or accessing the wiki replicas. Furthermore, it appears to me that doing either option 1 or 2 will require a fair bit of time to update the script while not breaking existing users.

The task that created this dump T173468 mentioned using the table on third-party wikis. However, I think it would be more secure and more up to date if T218592: Allow third party wikis to use blocks from the globalblocks API as a source for applied GlobalBlocks be implemented and the third-party wiki instead take the list from the API. Using the API would also allow third-party wikis to exclude non-IP blocks (as I doubt that the use-case would involve having account blocks).

Event Timeline

@Legoktm do you still have a use case for this dump? If so, can this be resolved through doing T218592 or just using the API directly?

@Legoktm do you still have a use case for this dump? If so, can this be resolved through doing T218592 or just using the API directly?

I was doing data analysis at the time (link) - I think it would be good if we could continue to keep dumping the table. But also I don't think there are any other table dumps that are partially redacted (e.g. T173237#3525344), and probably implementing this feature is more important than retaining the dumps unfortunately :(

@Legoktm do you still have a use case for this dump? If so, can this be resolved through doing T218592 or just using the API directly?

I was doing data analysis at the time (link) - I think it would be good if we could continue to keep dumping the table. But also I don't think there are any other table dumps that are partially redacted (e.g. T173237#3525344), and probably implementing this feature is more important than retaining the dumps unfortunately :(

Thanks for the information. Users with a need for this data can query the wiki replicas (such as through quarry.wmcloud.org) even if we remove the database dump, but obviously having a local copy of the data would be more efficient and easier to automate.

Change #1078901 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/puppet@production] dumps: Drop the globalblocks table dump

https://gerrit.wikimedia.org/r/1078901

Change #1078913 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/puppet@production] dumps: stop running the dump_global_blocks job

https://gerrit.wikimedia.org/r/1078913

As per slack discussion, noting here that @Milimetric, @VirginiaPoundstone and I agree on moving forward with option 3: dropping this dump.

The rationale is low value from this dump, while also supporting moving the Temporary accounts work forward.

Change #1078913 merged by Ladsgroup:

[operations/puppet@production] dumps: Stop running the dump_global_blocks job

https://gerrit.wikimedia.org/r/1078913

Change #1080272 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/puppet@production] dumps: Mark globalblocks dir and script as absent

https://gerrit.wikimedia.org/r/1080272

Change #1080272 merged by Ladsgroup:

[operations/puppet@production] dumps: Mark globalblocks dir and script as absent

https://gerrit.wikimedia.org/r/1080272

Change #1078901 merged by Ladsgroup:

[operations/puppet@production] dumps: Drop the globalblocks table dump

https://gerrit.wikimedia.org/r/1078901

Now that the patches are merged, I see that https://dumps.wikimedia.org/other/globalblocks/ is still present (not expected) and that the last run was on October 5 (expected). I think we can mark this task as resolved once https://dumps.wikimedia.org/other/globalblocks/ is no longer online.

Now that the patches are merged, I see that https://dumps.wikimedia.org/other/globalblocks/ is still present (not expected) and that the last run was on October 5 (expected). I think we can mark this task as resolved once https://dumps.wikimedia.org/other/globalblocks/ is no longer online.

@Ladsgroup @Ahoelzl do you know why https://dumps.wikimedia.org/other/globalblocks/ is still available? Could someone from Data-Engineering or SRE have a look at this please?

Now that the patches are merged, I see that https://dumps.wikimedia.org/other/globalblocks/ is still present (not expected) and that the last run was on October 5 (expected). I think we can mark this task as resolved once https://dumps.wikimedia.org/other/globalblocks/ is no longer online.

@Ladsgroup @Ahoelzl do you know why https://dumps.wikimedia.org/other/globalblocks/ is still available? Could someone from Data-Engineering or SRE have a look at this please?

We may need to manually remove the globalblocks folder from cloudduumps* servers? I tried finding it myself but couldn't.

CC @BTullis

Now that the patches are merged, I see that https://dumps.wikimedia.org/other/globalblocks/ is still present (not expected) and that the last run was on October 5 (expected). I think we can mark this task as resolved once https://dumps.wikimedia.org/other/globalblocks/ is no longer online.

@Ladsgroup @Ahoelzl do you know why https://dumps.wikimedia.org/other/globalblocks/ is still available? Could someone from Data-Engineering or SRE have a look at this please?

We may need to manually remove the globalblocks folder from cloudduumps* servers? I tried finding it myself but couldn't.

CC @BTullis

Hi, bumping this thread, so we could close out this task. @BTullis, could you please take a look? Thanks!

There is nothing more for Trust and Safety Product Team to do here, so I'm untagging our sprint, and also unlinking Temporary accounts as the remaining work (making sure https://dumps.wikimedia.org/other/globalblocks/ is offline) is not relevant to that project.

Thanks for the ping @kostajh and apologies for the delay in responding.

Your patches above removed the globalblocksdir from the dumpsdata100[3-7] servers, where dumps are initially written (by the snapshot servers).
However, the way that the dumps system currently works is that the dumpsdata servers regularly rsync the files to the dumps distribution servers, which are currently called clouddumps100[1-2].

When this sync occurs, existing files on the distribution servers are not deleted; new files are simply added to the colelction.
This setup has allowed us to have two servers with much larger storage (clouddumps) for holding many more copies of the dumps than the dumpsdata servers, which are more like a transient store.

This means that the previously dumped globalblocks files still exist on clouddumps100[1-2].

btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/other$ tree globalblocks/
globalblocks/
├── 20240713
│   └── 20240713-globalblocks.gz
├── 20240720
│   └── 20240720-globalblocks.gz
├── 20240727
│   └── 20240727-globalblocks.gz
├── 20240803
│   └── 20240803-globalblocks.gz
├── 20240810
│   └── 20240810-globalblocks.gz
├── 20240817
│   └── 20240817-globalblocks.gz
├── 20240824
│   └── 20240824-globalblocks.gz
├── 20240831
│   └── 20240831-globalblocks.gz
├── 20240907
│   └── 20240907-globalblocks.gz
├── 20240914
│   └── 20240914-globalblocks.gz
├── 20240921
│   └── 20240921-globalblocks.gz
├── 20240928
│   └── 20240928-globalblocks.gz
└── 20241005
    └── 20241005-globalblocks.gz

13 directories, 13 files

The simplest option for taking them offline is simply for me to delete this directory and everything in it, from both servers.

Is that approach acceptable to you, or should I be looking to archive these files somewhere?

Are we confident that they are not being used internally by any data pipelines?

There is nothing more for Trust and Safety Product Team to do here, so I'm untagging our sprint, and also unlinking Temporary accounts as the remaining work (making sure https://dumps.wikimedia.org/other/globalblocks/ is offline) is not relevant to that project.

Thanks for the ping @kostajh and apologies for the delay in responding.

Your patches above removed the globalblocksdir from the dumpsdata100[3-7] servers, where dumps are initially written (by the snapshot servers).
However, the way that the dumps system currently works is that the dumpsdata servers regularly rsync the files to the dumps distribution servers, which are currently called clouddumps100[1-2].

When this sync occurs, existing files on the distribution servers are not deleted; new files are simply added to the colelction.
This setup has allowed us to have two servers with much larger storage (clouddumps) for holding many more copies of the dumps than the dumpsdata servers, which are more like a transient store.

This means that the previously dumped globalblocks files still exist on clouddumps100[1-2].

btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/other$ tree globalblocks/
globalblocks/
├── 20240713
│   └── 20240713-globalblocks.gz
├── 20240720
│   └── 20240720-globalblocks.gz
├── 20240727
│   └── 20240727-globalblocks.gz
├── 20240803
│   └── 20240803-globalblocks.gz
├── 20240810
│   └── 20240810-globalblocks.gz
├── 20240817
│   └── 20240817-globalblocks.gz
├── 20240824
│   └── 20240824-globalblocks.gz
├── 20240831
│   └── 20240831-globalblocks.gz
├── 20240907
│   └── 20240907-globalblocks.gz
├── 20240914
│   └── 20240914-globalblocks.gz
├── 20240921
│   └── 20240921-globalblocks.gz
├── 20240928
│   └── 20240928-globalblocks.gz
└── 20241005
    └── 20241005-globalblocks.gz

13 directories, 13 files

The simplest option for taking them offline is simply for me to delete this directory and everything in it, from both servers.

Is that approach acceptable to you, or should I be looking to archive these files somewhere?

AFAIK, there's no need to archive these.

Are we confident that they are not being used internally by any data pipelines?

I don't find any usages via codesearch https://codesearch.wmcloud.org/search/?q=globalblocks&files=&excludeFiles=&repos=#operations/puppet

And also we haven't updated these files since early October. We could also do nothing, since these files don't have sensitive data, and we aren't writing any new files to this directory. But that might be confusing to someone who comes across that directory in the future.

The simplest option for taking them offline is simply for me to delete this directory and everything in it, from both servers.

Is that approach acceptable to you, or should I be looking to archive these files somewhere?

AFAIK, there's no need to archive these.

Ack.

We could also do nothing, since these files don't have sensitive data, and we aren't writing any new files to this directory. But that might be confusing to someone who comes across that directory in the future.

I'm more than happy to delete on the grounds of being tidy. I just don't like making that decision myself, if it's not my data to begin with.

I'm more than happy to delete on the grounds of being tidy. I just don't like making that decision myself, if it's not my data to begin with.

I think on T376726#10211856 it was concluded that we can remove it all, and that although less convenient, the data can be fetched elsewhere.

From my side, +1 to delete for less confusion on what dumps are active or inactive.

I have deleted the directories on both clouddumps servers.

btullis@clouddumps1001:/srv/dumps/xmldatadumps/public/other$ sudo rm -rf globalblocks/

btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/other$ sudo rm -rf globalblocks/

We are now getting a 404 response from: https://dumps.wikimedia.org/other/globalblocks/

image.png (179×1 px, 20 KB)

So I think we're done. I'll check again in a few days to make sure that there isn't another copy being synced from anywhere.

Hmm. I'm not getting a 404 even when I disable cache.

image.png (279×682 px, 23 KB)

Hmm. I'm not getting a 404 even when I disable cache.

image.png (279×682 px, 23 KB)

Oh, thanks for checking. You're right. There is a sync process that has re-already created some of these files.

btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/other$ tree globalblocks/
globalblocks/
├── 20240921
│   └── 20240921-globalblocks.gz
├── 20240928
│   └── 20240928-globalblocks.gz
└── 20241005
    └── 20241005-globalblocks.gz

3 directories, 3 files

I will check all of the dumpsdata servers for stray copies and try again to delete them.

Maybe our rsync config attempts to honor last n dumps?

Maybe our rsync config attempts to honor last n dumps?

Yes. This is puzzling.
I deleted the directories again on clouddumps100[1-2] but they have been recreated on both servers with these three dumps again.

I also checked for /data/xmldatadumps/public/other/globalblocks on each of the dumpsdata100[3-7] servers, but didn't find anything there.

I'll keep looking.

I found the stray files that were being copied. There were copies in /data/otherdumps/globalblocks on dumpsdata1003 and dumpsdata1006.

btullis@cumin1002:~$ sudo cumin C:dumps::generation::server::dirs 'ls -l /data/otherdumps/globalblocks/'
5 hosts will be targeted:
dumpsdata[1003-1007].eqiad.wmnet
OK to proceed on 5 hosts? Enter the number of affected hosts to confirm or "q" to quit: 5
===== NODE GROUP =====                                                                                                                                                                                             
(3) dumpsdata[1004-1005,1007].eqiad.wmnet                                                                                                                                                                          
----- OUTPUT of 'ls -l /data/othe...ps/globalblocks/' -----                                                                                                                                                        
total 0                                                                                                                                                                                                            
===== NODE GROUP =====                                                                                                                                                                                             
(1) dumpsdata1006.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'ls -l /data/othe...ps/globalblocks/' -----                                                                                                                                                        
total 12                                                                                                                                                                                                           
drwxr-xr-x 2 dumpsgen dumpsgen 4096 May 18  2024 20240518                                                                                                                                                          
drwxr-xr-x 2 dumpsgen dumpsgen 4096 May 25  2024 20240525
drwxr-xr-x 2 dumpsgen dumpsgen 4096 Jun  1  2024 20240601
===== NODE GROUP =====                                                                                                                                                                                             
(1) dumpsdata1003.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'ls -l /data/othe...ps/globalblocks/' -----                                                                                                                                                        
total 12                                                                                                                                                                                                           
drwxr-xr-x 2 dumpsgen dumpsgen 4096 Sep 21 08:15 20240921                                                                                                                                                          
drwxr-xr-x 2 dumpsgen dumpsgen 4096 Sep 28 08:15 20240928
drwxr-xr-x 2 dumpsgen dumpsgen 4096 Oct  5 08:15 20241005
================                                                                                                                                                                                                   
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (5/5) [00:01<00:00,  3.61hosts/s]
FAIL |                                                                                                                                                                             |   0% (0/5) [00:01<?, ?hosts/s]
100.0% (5/5) success ratio (>= 100.0% threshold) for command: 'ls -l /data/othe...ps/globalblocks/'.
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
btullis@cumin1002:~$

I have deleted these, so I think that's the ast of them.

btullis@dumpsdata1006:/data/otherdumps$ sudo rm -rf globalblocks/
btullis@dumpsdata1003:/data/otherdumps$ sudo rm -rf globalblocks/