Page MenuHomePhabricator

Minimize file leakage by expiring objects when using swift file-backend
Closed, ResolvedPublic5 Estimated Story Points

Description

Follow up after conversation at T320675

When writing new articles or modifying IPA notation it is likely that editors will utilize preview to do trial/error until they come up with IPA that sounds correct. Each of those failed attempts will leave an orphaned file behind. Similarly, this could happen a fair amount after rollout if the original IPA was wrong and needs fixing.

Possible approach

  • Fetch audio data at parsing time through onParserFirstCallInit hook
  • Persist as a file on PageSaveComplete as a DeferredUpdate making use of the cached audio data so to avoid hitting Google's API twice.
  • Make use of X-Delete-At as discussed in T320835#8434331 to expire un-used files

Acceptance Criteria
When using switch file-backend, files should have the header X-Delete-At set to a sensible date.

Follow up
When a common solution for expiring objects have been implemented this should be revisited. See T211661.

Event Timeline

JMcLeod_WMF set the point value for this task to 5.Oct 19 2022, 2:37 PM

@Eevans we can't really do this "multi-hook" approach. It doesn't work.

As mentioned in T320675#8330640, these are generated files, and if we keep the status quo and find later that there is significant leakage we could easily remove older files manually or via deleteOldPhonosFiles without any major repercussions on our end.

Is this good enough for your team until we see how everything falls into place, or do we need a "fancier" expiration strategy before we hit production?

@Eevans we can't really do this "multi-hook" approach. It doesn't work.

That's disappointing; What changed?

As mentioned in T320675#8330640, these are generated files, and if we keep the status quo and find later that there is significant leakage we could easily remove older files manually or via deleteOldPhonosFiles without any major repercussions on our end.

Is this good enough for your team until we see how everything falls into place, or do we need a "fancier" expiration strategy before we hit production?

I'm not sure; We've been operating under the assumption that some classes of leakage would be possible, but that the preview pattern would be solved by this issue. We're already starting from a position of some ambiguity with respect to utilization, my understanding was that previews could serve as a multiplier (i.e. were the data set becomes N * base, where N is the average number of previews per term). Will that turn "many hundreds of thousands" in to many million? If so -as you point out- we can always periodically truncate using deleteOldPhonosFiles, but to what extent will we need to monitor and operationalize that for production?

We should probably move this back to T320675 (since that is about requirements).

@Eevans we can't really do this "multi-hook" approach. It doesn't work.

That's disappointing; What changed?

Is there another ticket with details? A Slack discussion you can point me to?

That's disappointing; What changed?

Is there another ticket with details? A Slack discussion you can point me to?

I'm sorry but I think there isn't. The discussion happened on a call IIRC, but before getting too far into coding we realized that on a cache-miss there was no way to generate an audio file that wouldn't also do it for previews. The multi-hook approach works fine when we are editing because we can make use of PageSaveComplete but AFAIK there is no hook that will run on a cache-miss that will not also run on a preview.

We considered more elaborated solutions like having temporary files and promoting them after some time but that was adding too much complexity and felt a bit hacky.

Another idea was to emulate LRU by overwriting the files on parse (since there is no touch support) and use deleteOldPhonosFiles to remove files that haven't been used*. We think this is doable but the problem with this approach is that we'll be overwriting the same file over and over for as long as it lives and that's probably an issue(?)

Truth be told, we were hoping we could launch and see how things evolved before optimizing for storage. Maybe truncating once a year would be enough, we don't know.

*Used here has a weird definition. It means that no page containing a <phonos /> tag for that file has been parsed. Access to the file itself is irrelevant

My 2c.

That's disappointing; What changed?

Is there another ticket with details? A Slack discussion you can point me to?

I'm sorry but I think there isn't. The discussion happened on a call IIRC, but before getting too far into coding we realized that on a cache-miss there was no way to generate an audio file that wouldn't also do it for previews. The multi-hook approach works fine when we are editing because we can make use of PageSaveComplete but AFAIK there is no hook that will run on a cache-miss that will not also run on a preview.

Might sound stupid but you can introduce a hook in core if you want to.

Truth be told, we were hoping we could launch and see how things evolved before optimizing for storage. Maybe truncating once a year would be enough, we don't know.

I agree actually, we are speculating here a lot. I'd rather have this out of the door and then collect numbers. I would want to have some commitment of addressing this in let's say six months with actual data.

(Random note, I suggested a while ago you can just put this into ParserCache, not the file but the output of Google's response, if it's not a file I don't know the details, and then build the file on the fly. Just thinking out loud. I don't know the details of the problem.)

My 2c.

Truth be told, we were hoping we could launch and see how things evolved before optimizing for storage. Maybe truncating once a year would be enough, we don't know.

I agree actually, we are speculating here a lot. I'd rather have this out of the door and then collect numbers. I would want to have some commitment of addressing this in let's say six months with actual data.

We fully commit to revisit and deal with any issues that might arise from this project. That's not gonna be a problem.

(Random note, I suggested a while ago you can just put this into ParserCache, not the file but the output of Google's response, if it's not a file I don't know the details, and then build the file on the fly. Just thinking out loud. I don't know the details of the problem.)

You did and I forgot about it. I don't recall now if we discuss this as an option but we'll look into it.

@Eevans you mentioned Swift expiry in T320675#8317876. Would it be acceptable to set X-Delete-At to something like 3 months in the future, and if the file gets used a week prior to the expiration date, use describe to reset X-Delete-At for another 3 months? (Date ranges are TBD and can be adjusted as we go)

Wouldn't this solve any potential leakage?

Here is what I mean https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Phonos/+/858423

@Eevans you mentioned Swift expiry in T320675#8317876. Would it be acceptable to set X-Delete-At to something like 3 months in the future, and if the file gets used a week prior to the expiration date, use describe to reset X-Delete-At for another 3 months? (Date ranges are TBD and can be adjusted as we go)

Wouldn't this solve any potential leakage?

Here is what I mean https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Phonos/+/858423

Interesting, this is similar to T211661. The biggest difference being that what was proposed there would handle "touching" the X-Delete-At timestamp (probabilistically) via the rewrite middleware. I think an approach like this would work; You'd end up with something similar to LRU.

One problem here is that we don't have the object expirer deployed on our clusters. If we did, and we were to take an approach like this, we'd probably want to implement it as T211661 proposes, rather than rely on each implementation to handle it themselves. That said, I'm not sure that I think what T211661 proposes is best either (for example, it's a little too tightly coupled for my tastes). I have yet to comment with my concerns (or propose any alternatives), but roughly I was thinking of a) deploying the object-expirer to provide simple TTL-based retention, and b) proposing a side-car approach (like the object-expirer) that implements an actual LRU. Use-cases (like Phonos) that used Swift for caching could choose from either of those semantics.

At any rate, perhaps there is some middle ground. For instance, we could deploy the approach you propose now (including the object-expirer), but roll it back later when a more permanent solution becomes available. Or perhaps option (a) above would work for Phonos, and we could begin by setting the X-Delete-After header now, giving us the option to enable object-expirer later (if needed).

Thoughts?

/cc @MatthewVernon @KOfori

Two thoughts:

  1. maps are also thinking about expiry (from thanos-swift rather than ms-swift) cf T307184 ; a common approach might reduce wheel-reinvention.
  2. I'm quite opposed to adding functionality to our rewrite middleware; it makes me unhappy having something non-standard in the middle of our ms swift instances (and any rewrite-based approach wouldn't work on thanos, which doesn't have the rewrite middleware).

At any rate, perhaps there is some middle ground. For instance, we could deploy the approach you propose now (including the object-expirer), but roll it back later when a more permanent solution becomes available. Or perhaps option (a) above would work for Phonos, and we could begin by setting the X-Delete-After header now, giving us the option to enable object-expirer later (if needed).

Thoughts?

I'm happy with "middle ground". Lets roll with my approach and if we notice significant leakage we can mitigate it by enabling the object-expirer. If that happens, or whenever there is support for LRU, I'll revisit this and adjust accordingly.

Change 858423 had a related patch set uploaded (by Dmaza; author: Dmaza):

[mediawiki/extensions/Phonos@master] Set expiry header to minimize leakage

https://gerrit.wikimedia.org/r/858423

dmaza renamed this task from Split IPA audio generation into multiple hooks to Minimize file leakage by expiring objects when using swift file-backend.Dec 5 2022, 9:33 PM
dmaza updated the task description. (Show Details)

Change 858423 merged by jenkins-bot:

[mediawiki/extensions/Phonos@master] Set expiry header to minimize leakage

https://gerrit.wikimedia.org/r/858423

dom_walden added a subscriber: dom_walden.

On beta I confirmed that, every time the page is parsed or saved, the X-Delete-At header is set for a random date between 24 and 30 days in the future.

To test that it still works without Swift, I tested that audio files get generated on my local wiki.

I also tested that audio files get generated on testwiki (although I don't have access to testwiki to check that the expiration date is updated).

Test environments: