Page MenuHomePhabricator

Access request to deleted image files in the production Swift cluster
Closed, ResolvedPublic

Description

As part of Commons Upload Wizard's planned improvements, I'd like to request access to Commons deleted images in the backup cluster.
See T340546#9005987 for the initial conversation with @jcrespo.

Technical details

  • Download an initial dataset of all files deleted in a 1-year interval on Commons. This should roughly amount to 600k files
  • no hard requirements for download concurrency. It would be great if you could let me know a reasonable value
  • store the dataset either in the Analytics Hadoop cluster or in an Analytics client machine, depending on the final size
  • use the dataset to develop experimental machine learning models
  • keep the dataset until development is over

Access schedule

  • The initial dataset collection requires a one-off access
  • if we then decide to productionize the models, a quarterly access will be required (roughly)

Update: file names

@Ladsgroup , @MatthewVernon - please find below 5 attachments, each containing a list of deleted file names we'd like to access:

Can you please let me know when the files are available in stat1008? Thanks a lot!

Notes

  • 500 px thumbnails are enough, we don't need full-resolution files
  • would you be so kind to group image files by list? One directory per list is perfect, i.e., album covers, books, logos, screenshots, and out of domain
  • file names come from deletion requests that ended in a deletion, like this one. Therefore, files that got restored/undeleted afterwards might be present in the lists
  • the last attachment is gzipped to bypass phab's limit

Event Timeline

We should discuss this a bit- as this changes not only the initial hypothesis, but also the restrictions of your project:

  1. as I mentioned that backup access was possible, but I thought it was meant for a one-time study, not regular access, and then being deleted. Backup infrastructure was built to be used just for recovery- which means resources are very limited. If it can be reused for something punctually it is not a big deal, but continuous access may need rethinking and may not be enough- e.g. it has to be planned for that and may not even work. For example, backup access will be unavailable every time a recovery happens. Availability of the service is towards the low 90%s (host can be down for days due to lack of redundancy).
  1. if you plan to store files long term, and not just "download and delete", legal has to be involved because they require individual files to be deleted on the spot, no delay and without warning. This required explicit planning for the storing strategy, and attending legal requests. For example, using non-public files to generate a model may mean that you have to discard such model every time legal requests so ASAP.
  1. Concurrency will be low, because 1, but that shouldn't be a big issue because comparatively there are not many deleted files compared to public ones
  1. keep the dataset until development is over This is a bit vague- regarding access (storage is its own can of worms as mentioned on 2) I'd prefer if you set a specific time, at least for this first batch- even if later you request to extend it for any reason with no worries
  1. What is the preferred way in which you expect access? The ideal way is for you to setup a script that reads the metadata (mysql) and downloads using an s3-like interface. Would that work? Please note the dataset -even the deleted one- wouldn't be static at any point of time- files will be undeleted, hard deleted and in general renamed in production (backups use a different naming system that is static and simpler, but that is independent of the latest file status on production).

These are some starting topics to discuss, none seem too critical, except #2 (legal constraints). Feel free to answer those that seem reasonable and we can connect and discuss those unclear or problematic.

For example, there is a chance that access to production in read only may be preferred- eg. a php job that downloads deleted files- for which there is no current api.

  1. Initial access for a one-time study is really all we need for now, and if the data was ready for us to begin work on in early January that'd be perfect. If we succeed in creating a model of likelihood-of-deletion that has some predictive accuracy we can worry about keeping the model up-to-date with more recent data later on
  1. We can obvs permanently delete anything from our dataset whenever legal requests it, all we need is for the requests they already send to you to also come to us. Deleting and/or retraining a model if one or more of the images it was trained on was permanently deleted seems to me like overkill, but I guess it's up to legal. @SSpalding-WMF any thoughts?
  1. No problem
  1. 6 months?
  1. "The ideal way is for you to setup a script that reads the metadata (mysql) and downloads using an s3-like interface" - sounds fine to me. Writing a php job to do the downloads seems fine too

Ok. Then it seems we are ok for the most part- I will start working then on access, as this is the first time such access is requested, so please be patient- it won't be as fast as regular access request- but it shouldn't be too hard either. I will take the opportunity to document the grant creation in case in the future there are other requests.

However, please keep Trust and safety on the loop- we (data persistence) have a procedure to notify us of deletion by email and confirming the deletion from backups (they do it from production directly) and I am putting agreeing with them on such procedure as a requirement before providing the access. We involved legal when we setup backups and if should be in this case. Particularly, keeping files forever is not something I am comfortable with, despite deleted files not being really private files. I'd prefer if your policy was to reload them from time to time to avoid accident and leaks.

I may be able to provide a list of deleted files and its location on backups for a particular point of time and date range for your convenience (with the exception of renamed/hard deleted/undeleted files). Noting re:"use the dataset to develop experimental machine learning models", that many of the deleted objects are legitimate objects and they are not necessarily deleted because out of scope/vandalism/etc, but sometimes because they are duplicated, uploaded with higher quality or renamed with the wrong procedure (sorry if this was already obvious and accounted for).

One last thing- legal rarely comments on stuff here in public on Phab- you may want to reach them directly.

While checking the things I need to apply the change, I need 2 additional data points-

  • The list of ips where the files will be downloaded to, to open a hole on the firewall
  • A user name that is relevant to the project you are working on. For example, existing users are: 'media-backup-generation' and 'media-backup-recovery'
jcrespo triaged this task as High priority.Nov 8 2023, 8:10 PM

Not really sure on either of these, let us talk to people doing similar image-analysis work and get back to you ...

Feel free to contact other SREs that can support you (can be those in data engineering, as they may know more about Hadoop) to support you and they can get back to me directly too, if that helps.

@jcrespo , would it be possible to use the internal reverse proxy to directly download deleted images via HTTP like here?

@jcrespo , would it be possible to use the internal reverse proxy to directly download deleted images via HTTP like here?

No, that won't be possible. Backup access will only be possible through backup access, which is highly restrictive, and it doesn't use swift protocol. I wonder if what you may want is swift production access instead, which won't come with the same limitation as backup access, and that Matthew here, on CC, advised using instead of backups.

In any case, we are still blocked on T&C ok, and we will need the IPs of the machines that will be used for downloading, for firewall access.

@jcrespo , would it be possible to use the internal reverse proxy to directly download deleted images via HTTP like here?

No, that won't be possible. Backup access will only be possible through backup access, which is highly restrictive, and it doesn't use swift protocol. I wonder if what you may want is swift production access instead, which won't come with the same limitation as backup access, and that Matthew here, on CC, advised using instead of backups.

Thanks for the heads up about production VS backup. The backup access request was merely based on my understanding that deleted images were stored there. If production access is a better option, then let's definitely opt for it, CC @MatthewVernon.

In any case, we are still blocked on T&C ok

Is there anything we can do from our side to unblock?

and we will need the IPs of the machines that will be used for downloading, for firewall access.

Looping in @BTullis: can you please list stat1005, stat1008, and stat1009 IPs? These 3 boxes look like good candidates for downloading the requested dataset. We can then decide do store it in Hadoop or keep it locally, depending on the final size.

jcrespo renamed this task from Access request to deleted image files in the backup cluster to Access request to deleted image files in the production Swift cluster.Dec 4 2023, 11:52 AM

Updating title to reflect current request.

This is currently on the clinic duty workboard, but outside of clinic duties normal access requests. @jcrespo or @MatthewVernon who do you think should approve acccess for this request? Also, once approved, who should implement access?

There is ongoing conversations with legal, which doesn't write here. Don't worry- deployment of this should be handled by data persistence team (I think), at least any changes on Swift that may be needed.

herron subscribed.

Hello! I'm removing the access request tag from this task as it doesn't appear actionable by sre clinic duty. Please re-add if/when clinic duty attention is needed. Thanks!

I think we've reached an agreement: we'll send a list of file names and DB admins will place an archive file in the analytics clients for us to pick up.
stat1008 would be the target machine, CC @Ladsgroup @MatthewVernon for a final confirmation.

We are in process of extracting album covers but I want to mention that we are not going to extract and send over 214,000 images of out of scope. Please sample them.

mvernon@stat1008:~$ ls -lsh
total 2.5G
877M -rw-r--r-- 1 root root 877M Jan 24 16:28 album_covers.tar.bz2
1.7G -rw-r--r-- 1 root root 1.7G Jan 24 17:44 screenshots.tar.bz2

That's the album covers and screenshots done. A few notes:

  • We needed to extract the sha1 from filearchive, roughly select fa_name, fa_sha1 from filearchive where fa_name = '{}';
    • This of necessity skipped objects with ' in their name
    • Also, where there was more than one match we selected one arbitrarily
  • Then use the sha1 to find the right object in swift (P55535) - you get the originals (since there might not be suitable thumbnails)
  • Then transferpy to stat1008

Hopefully that's enough to be getting on with for now; books and logos to follow in due course.

I've now done books.tar.bz2; of the 2527 objects you requested, filearchive knew of 2453, and the tarball contains 2441 images.

I've now done logos.tar.bz2, which is a 4.7G file; of the 11,153 objects you requested, filearchive contained 10,770 of them, and I was able to download 10,764 of those.

For this extract we used a slightly improved process; we extracted fa_storage_key from filearchive which simplifies finding the right object in swift, and added quoting for ' in object names - this is P55685. That meant we could then simplify the extraction process a little, as at P55686.

I'd like to draw your attention to @Ladsgroup's comment above about the out_of_domain images; we need a smaller set to extract (the frontend servers don't have the capacity to store that many images if nothing else!).

I hope the tarballs we've made and put on stat1008 are useful.

[I should say: these are all originals, because we wouldn't necessarily have thumbnails for deleted objects and couldn't straightforwardly generate them either]

Thank you @MatthewVernon for your work, much appreciated!
Since I was assuming thumbnails, I thought that it wouldn't be a problem to get as many out-of-domain images as possible. I'll sample them to 12k then. Please note that this is a hard requirement to enable a fair evaluation playground for logos.

@MatthewVernon, please find attached the out-of-domain sample:


Looking forward to it, thanks again!

@MatthewVernon, please find attached the out-of-domain sample:


Looking forward to it, thanks again!

I can't access that, I think it's "Restricted" and/or not visible to me?

Full error message:
Access Denied: Restricted File
You do not have permission to view this object.
Users with the "Can View" capability:

mfossati (Marco Fossati) can take this action.
The user who uploaded a file can always view and edit it.Files attached to objects are visible to users who can view those objects.Thumbnails are visible only to users who can view the original file.

Thanks for the heads up, should be fixed now

@mfossati Of the 12,000 objects you named, I could find 11,608 in the database, and was able to download 11,596 objects.
out_of_domain.tar.bz2 is 23G, and available on stat1008 like the others.

Is that everything you need for this ticket now?

MatthewVernon claimed this task.