Page MenuHomePhabricator

Create a read-only swift identity for backup taking
Closed, ResolvedPublic

Description

Tools created to communicate with swift work well using mediawiki credentials. However, it is dangerous to reuse mw credenctials for backup taking, as a bug on the software could lead to data loss.

Ideally we would use:

  • An account with the same "read" (download, list, stat) privileges on the mediawiki containers
  • Doesn't have any write/drop/upload privileges on existing containers
  • Cannot create new containers
  • As of now, it only has privileges on originals (current, old and deleted) containers

For now, it doesn't need to be automatically updated by mediawiki new wiki process- we will monitor changes on wikis through alerts & the data persistence tickets and create a process to add new wikis to backup.

Event Timeline

^@fgiunchedi this is the task I told you about (pinging on comment because sometimes notifications cannot be seen on creation).

(braindumping) we had a similar case in the past (namely adding an account to mw containers), i.e. thumbor. Steps off the top of my head:

  1. create the account and credentials
  2. add said account name to mw scripts that manage filebackend containers
  3. backfill permissions

Of course, I hadn't remembered that it should keep working for newly wikis created. Thanks for pointing that. I will have a look at thumbor and try to learn what they did. Will ask for your review on any patch.

jcrespo triaged this task as Medium priority.

We will limit the scope for now of the task so we can have it done now- I think it also makes sense to semi-automatically add new wikis to backups, the same way it is added to cloud replicas.

Change 773298 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] swift: Create a new read-only role on mw account for backup taking

https://gerrit.wikimedia.org/r/773298

@MatthewVernon Filippo said he doesn't have the bandwidth to help with the patch and recommended contacting you. Could you have a look at the task and the proposed patch/conversation?

Is this urgent? I have a number of things I'm trying to land before the end of the quarter...

Change 889806 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[labs/private@master] swift: Add dummy mediabackup passwords on the same keys as production

https://gerrit.wikimedia.org/r/889806

Change 889806 merged by Jcrespo:

[labs/private@master] swift: Add dummy mediabackup passwords on the same keys as production

https://gerrit.wikimedia.org/r/889806

Change 773298 merged by Jcrespo:

[operations/puppet@production] swift: Create a new read-only role on mw account for backup taking

https://gerrit.wikimedia.org/r/773298

Apparently (not 100% sure), the new account gives me GET permissions on the public containers, but not on the deleted ones:

✔️ root@ms-fe1009:~$ . <backup env>
✔️ root@ms-fe1009:~$ swift download wikipedia-mediawiki-local-deleted h/z/2/hz2uewnct6qmy8wnv5pe2i8i9t746oz.png
Error downloading object 'wikipedia-mediawiki-local-deleted/h/z/2/hz2uewnct6qmy8wnv5pe2i8i9t746oz.png': Object GET failed: http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-mediawiki-local-deleted/h/z/2/hz2uewnct6qmy8wnv5pe2i8i9t746oz.png 403 Forbidden  [first 60 chars of response] b'<html><h1>Forbidden</h1><p>Access was denied to this resourc'

✔️ root@ms-fe1009:~$ swift download wikipedia-commons-local-public.57 5/57/0153_Duisburg.JPG
5/57/0153_Duisburg.JPG [auth 0.014s, headers 0.140s, total 0.279s, 20.898 MB/s]

✔️ root@ms-fe1009:~$ . <mw env>

✔️ root@ms-fe1009:~$ swift download wikipedia-mediawiki-local-deleted h/z/2/hz2uewnct6qmy8wnv5pe2i8i9t746oz.png
h/z/2/hz2uewnct6qmy8wnv5pe2i8i9t746oz.png [auth 0.019s, headers 0.112s, total 0.113s, 0.900 MB/s]

Could you help me with this one, @MatthewVernon or @fgiunchedi?

jcrespo changed the task status from Open to In Progress.Feb 22 2023, 10:22 AM

I think the issue is that the deleted container has different permissions:

root@ms-fe1009:/home/mvernon# swift stat wikipedia-mediawiki-local-deleted | grep 'Read ACL'
        Read ACL: mw:thumbor-private,mw:media
root@ms-fe1009:/home/mvernon# swift stat wikipedia-commons-local-public.57 | grep 'Read ACL'
        Read ACL: mw:thumbor,mw:media,.r:*

So you can see the deleted container is only available to the mw:thumbor-private and mw:media accounts, whereas the public container is readable by any swift account.

How to add mw:backup to local-deleted containers? (I assumed this is handled on wiki creation- I will check that on my own), but how to do if for now on existing containers on a one-time run?

Yeah, that's a good question - I think there are about 21675 deleted containers.
I think there's no automation for container management (is that right @fgiunchedi ?) so the options are presumably to write a loop that adds the mw:backup account to the read ACL for each deleted container, or maybe to make a read-only account ACL so the mw:backup user has read-only access to everything the mw:media account owns?

[I've never tried tweaking ACLs, though, and there doesn't seem to be a lot of tooling in this area]

I think there's no automation for container management

Don't worry too much about details/implementation, as that is something I can solve- my worry is to make sure I don't do anything without your (swift owners) ok/oversight- the actual work/code/automation I can take it myself as part of my work- unless there is something you want to be involved because it will be useful for your work. If also that means improving Swifts automation/docs, I can do that too.

If you point me to [Swift] documentation, I can propose something and you can review it (be it on gerrit or here on ticket) before taking any action. Probably we should start with changing one small container to confirm that solves the permissions issue (and doesn't break anything else- e.g. testwiki-deleted).

I wonder (but this is not a settled position) whether using an account ACL is the more elegant solution, as we do that once and it'll work for all deleted containers?
The upstream docs are a bit ... spartan, swiftstack provide slightly more information.

Mentioned in SAL (#wikimedia-operations) [2024-02-22T13:03:01Z] <Emperor> ms-eqiad set ACL {"read-only":["mw:backup"]} T269108

Mentioned in SAL (#wikimedia-operations) [2024-02-22T13:05:02Z] <Emperor> ms-codfw set ACL {"read-only":["mw:backup"]} T269108

@jcrespo can you try now, please?

I constructed the appropriate URL thus:

matthew@tsk:~/puppet$ python3
Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from swift.common.middleware.acl import format_acl
>>> acl_data = { 'read-only': ['mw:backup'] }
>>> acl_string = format_acl(version=2, acl_dict=acl_data)
>>> print(acl_string)
{"read-only":["mw:backup"]}

Then, when authenticated as the mw:media user, ran swift auth to populate OS_STORAGE_URL and OS_AUTH_TOKEN, and then set an ACL thus:

curl -X POST -i -H "X-Auth-Token: $OS_AUTH_TOKEN" -H 'X-Account-Access-Control: {"read-only":["mw:backup"]}' $OS_STORAGE_URL

Which returns 204

Now if I do swift stat | grep Access, I see:

X-Account-Access-Control: {"read-only":["mw:backup"]}

And if I log in as mw:backup I can now do e.g. swift list wikipedia-mediawiki-local-deleted, so I think the mw:backup user should now have the necessary read-only permissions to everything in the mw:media account you need to run backups?

Thank you a lot, as I mentioned in private, I will try to run the automatic downloads back again with the new user, if it works we will be very close to resolve this. Thank you again!

jcrespo closed this task as Resolved.EditedFeb 22 2024, 5:34 PM

It took some time to confirm it live, because the number of new deleted files don't grow as fast as the "latest" versions, and the algorithm starts backing up the non-deleted ones first, but I can confirm the new deleted files backups count is growing up with the new user:

root@db1204.eqiad.wmnet[mediabackups]> select count(*) FROM backups join files on backups.wiki = files.wiki AND backups.sha1 = files.sha1 where files.wiki=454 and status = 3 and backup_time > '2024-02-22 17:00:00' order by backup_time;
+----------+
| count(*) |
+----------+
|       23 |
+----------+
1 row in set (0.002 sec)

root@db1204.eqiad.wmnet[mediabackups]> select count(*) FROM backups join files on backups.wiki = files.wiki AND backups.sha1 = files.sha1 where files.wiki=454 and status = 3 and backup_time > '2024-02-22 17:00:00' order by backup_time;
+----------+
| count(*) |
+----------+
|       24 |
+----------+
1 row in set (0.003 sec)

I also checked the logs, and found only 2 "errors" for backups of deleted files, and that is normal and expected 404s (not found errors), which are expected 1 out of 100 files due to files changing state while backups run, and will be fixed on the next backup round- they are not the permission errors I had before.

I cannot be 100% sure this is fixed- There could be some weird edge case that makes it fail with a particular wiki that is missconfigured, or a swift bug or something strange, but I am now quite confident this is working. Thanks to that this has now unblocked me to productionize the swift backup user en ms-backup hosts and resolve the parent task- and I consider this resolved.

Thank you!