Page MenuHomePhabricator

Establish Phonos production storage requirements
Open, Needs TriagePublic

Description

From T317417#8280934:

How big are the files themselves (min, max, average)? How many do we expect; What's the anticipated total storage? Read & write rates?

We aren't setting a maximum file size currently, but we do have a maximum amount of IPA that can be passed to the parser tag. That is currently set to 300 bytes (T316641). The generated MP3 in that case is somewhere in the neighborhood of 40-50kb, but on average files are going to be maybe be 3-5kb. The number of files generated depends on how well the communities adopt this feature. If fully rolled out across all wikis (we use Phonos everywhere we show IPA), we're probably looking at many hundreds of thousands of files, but I think it will be quite a while before we reach that point. Read rates are estimated to at around ~1.8 million a month (T307625). Write rates is harder to estimate, again depending on how communities adopt this feature. Since most wikis have a Template:IPA or something similar, the Phonos parser tag will likely be put there. Thus, the initial rollout will see a very high number of writes (directly proportional to the number of transclusions of Template:IPA), but we are building our own type of job and making it go slower than the normal job queue rate (T318086). After the rollout, writes will likely be by comparison relatively rare. Note however some communities might prefer to opt-in to using Phonos on a case-by-case basis, making the rollout longer and hence lower write rates.

Requirements (strawperson):

  • 500000 objects (est.)
  • 5KB/50KB (avg/max) in size
  • 1.8M reads/month (est.)
  • ?? writes/month

Additional questions:

  • How will retention be managed?

Event Timeline

Eevans updated the task description. (Show Details)
Eevans added a subscriber: MusikAnimal.
Eevans renamed this task from Establish Phonos production storage requirements. to Establish Phonos production storage requirements.Oct 12 2022, 7:56 PM
Eevans updated the task description. (Show Details)

Thanks for the info; Apologies for not following up on this sooner. I asked about utilization because we (Data-Persistence) are trying to engage with projects that need storage earlier, understand the requirements, be in a position to offer feedback, and plan accordingly. Ideally, earlier would be earlier than where we are now, but it would still be great to run through your storage requirements.

Since this seems out of scope for this ticket (sorry about that) -and since I didn't find a suitable existing issue- I've stubbed out T320675 for this.

I should have linked to it in my reply at T317417#8280934, but we have T314789 and T309315 from much earlier in this project where we tried to consult with Data Persistence. We never heard anything (and we're not faulting anyone -- we know people are busy!) and Legoktm came through to clarify Swift is what we wanted, so we proceeded. What we're doing is basically identical to what MediaWiki-extensions-Score does, so we had that added confidence as well -- also noting our files are considerably smaller than Score's. We were also told storage in general isn't really a problem.

That said, we'd love to hear any feedback if you have any! We hope the route we took is technically sound and won't cause any problems.

Thanks for the info; Apologies for not following up on this sooner. I asked about utilization because we (Data-Persistence) are trying to engage with projects that need storage earlier, understand the requirements, be in a position to offer feedback, and plan accordingly. Ideally, earlier would be earlier than where we are now, but it would still be great to run through your storage requirements.

Since this seems out of scope for this ticket (sorry about that) -and since I didn't find a suitable existing issue- I've stubbed out T320675 for this.

I should have linked to it in my reply at T317417#8280934, but we have T314789 and T309315 from much earlier in this project where we tried to consult with Data Persistence.

Oh, I missed these entirely! Would you rather we reopen and use one of these (T314789 presumably), or continue here?

We never heard anything (and we're not faulting anyone -- we know people are busy!) and Legoktm came through to clarify Swift is what we wanted, so we proceeded. What we're doing is basically identical to what MediaWiki-extensions-Score does, so we had that added confidence as well -- also noting our files are considerably smaller than Score's. We were also told storage in general isn't really a problem.

We haven't always had the capacity to engage on things like this as much as we'd like, and we're looking to change that.

That said, we'd love to hear any feedback if you have any! We hope the route we took is technically sound and won't cause any problems.

It seems reasonable. I think the total size here is less of a concern than the total number of files (there is a cost associated with that too). There is also some concern about how retention is managed, if we're treating this as a cache.

Oh, I missed these entirely! Would you rather we reopen and use one of these (T314789 presumably), or continue here?

No worries! Here is fine. Those old tasks are out of date as it is. For one, we're not using WAV anymore, and:

There is also some concern about how retention is managed, if we're treating this as a cache.

Let me be clear that this is NOT a proper caching technique (though we did envision it that way initially). What we want is persistent storage, pretty much exactly like Extension:Score. Files are shared globally, last indefinitely (but note we do have deleteOldPhonosFiles.php), etc.

I think the total size here is less of a concern than the total number of files (there is a cost associated with that too).

Interesting! It's in the task description, but to reiterate, due to the nature of rolling this out, it will probably be a good chunk of time before we have a meaningfully high number of files. What issues do you foresee, and how might we overcome them?

Oh, I missed these entirely! Would you rather we reopen and use one of these (T314789 presumably), or continue here?

No worries! Here is fine. Those old tasks are out of date as it is. For one, we're not using WAV anymore, and:

There is also some concern about how retention is managed, if we're treating this as a cache.

Let me be clear that this is NOT a proper caching technique (though we did envision it that way initially). What we want is persistent storage, pretty much exactly like Extension:Score. Files are shared globally, last indefinitely (but note we do have deleteOldPhonosFiles.php), etc.

Is there no point when a pronunciation is altered or replaced, invalidating the mp3? Or removed? What happens if there is a change to the tooling used to create the mp3; Is there no scenario where we'd want to replace legacy audio files?

I think the total size here is less of a concern than the total number of files (there is a cost associated with that too).

Interesting! It's in the task description, but to reiterate, due to the nature of rolling this out, it will probably be a good chunk of time before we have a meaningfully high number of files. What issues do you foresee, and how might we overcome them?

I don't know that there will be, but the total number of files caught some by surprise (though I'm still not sure we know the total with any precision).

Is there no point when a pronunciation is altered or replaced, invalidating the mp3? Or removed?

I suppose that's where the deleteOldPhonosFiles.php script comes into play. Seemingly this is the only thing Extension:Score does to guard against the same concern. We don't have a way to keep track of which files are orphans after the fact, but if the concern is large enough, I believe we should be able to use some hook to detect when a file is about to become orphaned and delete it. Though, now that I think about it, that file could also be used on another wiki (or even another page on the same wiki) so maybe that's not the best way to go about it. It's worth noting that if files are ever missing (which currently can only happen after running the maintenance script), the file gets re-generated the next time the page is parsed. So I guess the aforementioned scenario still isn't a deal-breaker since the file will eventually get regenerated anyway.

What happens if there is a change to the tooling used to create the mp3; Is there no scenario where we'd want to replace legacy audio files?

If we know there's a change to the rendering engine that warrants invalidating the existing storage, we can bump Engine::CACHE_VERSION.

Is there no point when a pronunciation is altered or replaced, invalidating the mp3? Or removed?

I suppose that's where the deleteOldPhonosFiles.php script comes into play. Seemingly this is the only thing Extension:Score does to guard against the same concern.

I'm not familiar with Extension:Score either, but it could very well be that the same concerns apply there as well.

We don't have a way to keep track of which files are orphans after the fact, but if the concern is large enough, I believe we should be able to use some hook to detect when a file is about to become orphaned and delete it. Though, now that I think about it, that file could also be used on another wiki (or even another page on the same wiki) so maybe that's not the best way to go about it. It's worth noting that if files are ever missing (which currently can only happen after running the maintenance script), the file gets re-generated the next time the page is parsed. So I guess the aforementioned scenario still isn't a deal-breaker since the file will eventually get regenerated anyway.

What is the cost of regeneration?

What happens if there is a change to the tooling used to create the mp3; Is there no scenario where we'd want to replace legacy audio files?

If we know there's a change to the rendering engine that warrants invalidating the existing storage, we can bump Engine::CACHE_VERSION.

If I'm reading the code correctly, this would result in a new file (under a new file name) being written. Wouldn't that de facto regenerate all of the mp3s (while leaving the old ones in storage)?

It seems like there are in fact scenarios where we could "leak" storage, where a change of some kind would result in obsolete or unneeded mp3s being left behind. If we can establish that this "leakage" is sufficiently small, then we can probably safely ignore it. For example, if we new we'd leak 50k files over the next 10 years (say ~250MB), that certainly wouldn't justify spending any amount of engineering effort to avoid. But if (for example) changing Engine::CACHE_VERSION leaks 100% of the storage set each time it's changed, we'll definitely want some kind of answer for reliably dealing with that. Maybe there are examples between those extremes too.

What is the cost of regeneration?

We use Google TTS in production, so it is in fact not free, but for a single file it's a trivial cost. Performance-wise -- which I assume is what you meant -- I think it's relatively cheap. One external request made from the backend and then storing the file. Note that en masse (say an editor adds a <phonos> tag to a highly-transcluded template), the file generation is done through a PhonosIPAFilePersistJob that we will throttle via $wgJobBackoffThrottling. Our plan was to have this set to be within Google's API limits, which currently is 10,000 requests per minute (T316009), but we will surely make this much slower. Even at full speed, I don't see all the operations for a single file generation finishing in just 6 ms, so don't be alarmed when you read that! I don't know what rate we'll go with, but if you have recommendations please do let us know :)

If we know there's a change to the rendering engine that warrants invalidating the existing storage, we can bump Engine::CACHE_VERSION.

If I'm reading the code correctly, this would result in a new file (under a new file name) being written. Wouldn't that de facto regenerate all of the mp3s (while leaving the old ones in storage)?

Correct. If we ever were to bump the cache version, we would want to also run the deleteOldPhonosFiles.php script to delete all files created before that time. I will add a comment in the code to remind us about this.

It seems like there are in fact scenarios where we could "leak" storage, where a change of some kind would result in obsolete or unneeded mp3s being left behind. If we can establish that this "leakage" is sufficiently small, then we can probably safely ignore it. For example, if we new we'd leak 50k files over the next 10 years (say ~250MB), that certainly wouldn't justify spending any amount of engineering effort to avoid. But if (for example) changing Engine::CACHE_VERSION leaks 100% of the storage set each time it's changed, we'll definitely want some kind of answer for reliably dealing with that. Maybe there are examples between those extremes too.

It's hard to say at this stage how many orphaned files we will have. I suspect most of the time, editors will hear that Phonos works and not touch it ever again. What we do expect is some trial/error with previewing when someone adds a new IPA transcription to a page. I'll guesstimate this is at most 10 files for a new page, but probably on average just a few (?). We will just have to see how editors take to using the feature. We thought about disabling file generation in preview, but figured editors would actually prefer that as they can tweak the IPA until it sounds correct.

I'll get a Slack thread going with the team about this, and invite you to it. I'm thinking it's probably worthwhile to do some prevention of leaking storage. If we can reliably detect that we're in preview, we can perhaps use a short-lived cache to keep track of each file name, and then on save delete any that aren't being used anymore. A similar tactic may be possible even across subsequent saves of a page (but the cache misses might be high in that case).

Change 842515 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[mediawiki/extensions/Phonos@master] Engine: add reminder to run maint script after bumping CACHE_VERSION

https://gerrit.wikimedia.org/r/842515

Change 842515 merged by jenkins-bot:

[mediawiki/extensions/Phonos@master] Engine: add reminder to run maint script after bumping CACHE_VERSION

https://gerrit.wikimedia.org/r/842515

To summarize what was discussed on Slack:

  • If each preview results in the storage of an mp3, actual storage will be a multiple of expected/live storage
    • T320835 was created to ameliorate this by storing mp3s to memcache, and persisting with a deferred update from the PageSaveComplete hook
  • Periodic runs of deleteOldPhonosFiles.php (or use of Swift expiry) could be used to limit unbounded growth, but is heavy-handed/indiscriminate
  • LRU be a more optimized approach to bounding growth, but would be difficult to implement without the last access time of objects

To summarize what was discussed during a Data Persistence meeting earlier today (in no particular order):

  • The total (expected ) volume of data for Phonos is small enough that it should not be an issue, but there are still costs associated with the number of objects in a Swift container (because of indexing requirements). The "hundreds of thousands" discussed should be fine, millions could become a problem. There is some concern as to the provenance of the estimate ("hundreds of thousands"), and whether the potential for file leakage is great enough to inflate it substantially.
  • Whether or not this use case can be categorized as "cache" has come up numerous times, but nomenclature notwithstanding there was consensus that what we have here is secondary data. It isn't the authoritative or canonical form, it's generated (and could be regenerated if necessary). Persisting these audio files is really an optimization; We're optimizing for performance and/or cost. In this sense, it is cache-like enough to think about it in the same cost:benefit terms.
  • There was consensus that estimating costs would be a worthwhile exercise. The costs of a cache miss (monetary if the external service is fee-based, and/or latency), and the costs of storing all files in Swift, in perpetuity
  • There was some discussions of alternate caching/storage strategies. Memcache was one suggestion, and this would make LRU possible. Another idea was to leverage the MediaWiki parser cache (storing the audio as an option). The latter is interesting because it establishes a dependent relationship between the page and any IPA terms contained. Both of these suggestions would make the trade-off of additional external requests (regenerating audio) in favor of less expensive/complex cache management. To argue that any of these (including the current implementation) was superior though would depend upon a cost:benefit analysis (see above).
  • There seemed to be general agreement that having in place a plan to proactively measure/monitor data set growth and usage, a contingency plan in place if reality differs significantly from expectation (deleteOldPhonosFiles.php presumably?), and a commitment from Community-Tech to revisit longer-term if necessary, would satisfy any concerns over unexpected/aberrant utilization.

Since T320835: Minimize file leakage by expiring objects when using swift file-backend appears to be in jeopardy (see: T320835#8361202), perhaps we can revisit some of these concerns:

  • The total (expected ) volume of data for Phonos is small enough that it should not be an issue, but there are still costs associated with the number of objects in a Swift container (because of indexing requirements). The "hundreds of thousands" discussed should be fine, millions could become a problem. There is some concern as to the provenance of the estimate ("hundreds of thousands"), and whether the potential for file leakage is great enough to inflate it substantially.
  • Whether or not this use case can be categorized as "cache" has come up numerous times, but nomenclature notwithstanding there was consensus that what we have here is secondary data. It isn't the authoritative or canonical form, it's generated (and could be regenerated if necessary). Persisting these audio files is really an optimization; We're optimizing for performance and/or cost. In this sense, it is cache-like enough to think about it in the same cost:benefit terms.
  • There was consensus that estimating costs would be a worthwhile exercise. The costs of a cache miss (monetary if the external service is fee-based, and/or latency), and the costs of storing all files in Swift, in perpetuity
  • There was some discussions of alternate caching/storage strategies. Memcache was one suggestion, and this would make LRU possible. Another idea was to leverage the MediaWiki parser cache (storing the audio as an option). The latter is interesting because it establishes a dependent relationship between the page and any IPA terms contained. Both of these suggestions would make the trade-off of additional external requests (regenerating audio) in favor of less expensive/complex cache management. To argue that any of these (including the current implementation) was superior though would depend upon a cost:benefit analysis (see above).

Can we take a stab at estimating these costs? Is there a monetary value we can assign to the generation of an mp3 (perhaps that's something beyond a certain quota)? What is the latency cost of a cache miss?

  • There seemed to be general agreement that having in place a plan to proactively measure/monitor data set growth and usage, a contingency plan in place if reality differs significantly from expectation (deleteOldPhonosFiles.php presumably?), and a commitment from Community-Tech to revisit longer-term if necessary, would satisfy any concerns over unexpected/aberrant utilization.

If we expect to be leaning more heavily on this (i.e. if it's less contingency plan and more SOP), we might need to be that much more proactive about monitoring and alerting of utilization, and having in place a concrete (and tested) plan (or even automation).

I originally read T320675#8330640 as giving the go-ahead, but with the shared understanding there will be proper instrumentation and monitoring on our end moving forward. We can certainly commit to that. Our first pilot wikis (T316013) are comparatively very small, but will give us a better idea of the costs you're asking about, monetary or otherwise.

Every page that uses Phonos is placed in a tracking category. So, it actually shouldn't be too hard to write a maintenance script to find all the files that are being used, and with that we'd be able to identify which remaining unused files there are. Basic idea is to loop through the category on each wiki where the extension is installed (which in the end will only be a few dozen or so, I think), use the RESTBase API to quickly fetch HTML, an XML parser to identify the Phonos elements and parse out the href attribute, and finally search the file system, which given the two hash levels should be pretty efficient. I do not propose this as a long-term solution unless we're confident it's efficient enough, but perhaps it's worthwhile to have it for the initial rollout and early monitoring.

What is the latency cost of a cache miss?

Another request to Google, which in our testing seems to be relatively fast. Larger IPA payloads could mean it takes longer, but I think in most cases it would be a trivial amount from a user perspective. If it's helpful, I think we can easily keep track of the response times.

I originally read T320675#8330640 as giving the go-ahead, but with the shared understanding there will be proper instrumentation and monitoring on our end moving forward. We can certainly commit to that. Our first pilot wikis (T316013) are comparatively very small, but will give us a better idea of the costs you're asking about, monetary or otherwise.

T320675#8330640 was primarily about capturing discussion and the resulting consensus from the Data-Persistence side for posterity sake; I think your reading was accurate. Those conclusions though were based on the understanding that the leak vector created by previews was going to be solved. T320675#8330640 would seem to indicate that this is no longer the case. As I've mentioned elsewhere, that issue sounds like it would have a multiplicative affect, that it would make the dataset N times bigger, where N is the average number of previews per term.

Every page that uses Phonos is placed in a tracking category. So, it actually shouldn't be too hard to write a maintenance script to find all the files that are being used, and with that we'd be able to identify which remaining unused files there are. Basic idea is to loop through the category on each wiki where the extension is installed (which in the end will only be a few dozen or so, I think), use the RESTBase API to quickly fetch HTML, an XML parser to identify the Phonos elements and parse out the href attribute, and finally search the file system, which given the two hash levels should be pretty efficient. I do not propose this as a long-term solution unless we're confident it's efficient enough, but perhaps it's worthwhile to have it for the initial rollout and early monitoring.

This is good information to have!

What is the latency cost of a cache miss?

Another request to Google, which in our testing seems to be relatively fast. Larger IPA payloads could mean it takes longer, but I think in most cases it would be a trivial amount from a user perspective. If it's helpful, I think we can easily keep track of the response times.

Where I was going with this is that I think we can think of this in the same terms that we do any other caching problem. You could store everything, unconditionally, in perpetuity, but you could also store nothing (as it's an optimization after all). There is quite a bit of middle ground though (assuming we can implement the retention semantics), and the right answer is presumably a function of cost v. benefit. If we could articulate those associated costs and benefits, then perhaps there are alternatives here that we've shut ourselves off from.

I'm not necessary trying to take you back to the drawing board here (and it would be fair to drop refactoring of the extension under a list of costs too). Even if we move forward as-is though, it'd be great to document what was considered, and what we might have done differently.

Ok, so let's try to move this in a (more) constructive direction:

If I'm being honest, I think that were we having this discussion in May when T309315 was opened, I'd be recommending something other than Swift; I don't think Swift is the best choice here (at least with these semantics). May is very far in the rear-view mirror though, and we squandered our opportunity to engage when this was in the design phase. We should move forward, despite not having an answer for T320835.

I do have concerns though. We have very little concrete data to work with. We have some idea of what file sizes will be (5KB avg/50KB max), but less of an idea of how many to expect. However many we can expect (i.e. the number of terms defined), we know that there will actually be more (i.e. files that are leaked), but we don't how many more we should expect. We also don't know the rate at which to expect the dataset to grow. So if we all agreed to look at this in say 6 months, what observations would we make, and how would we make sense of them? If we found say 500k files in the container, what would that mean? Does this represent 10% of the terms we expect to be added, 50% (i.e. where are we on the growth curve)? And if we found 500k files in the container, is that against 400k terms (100k leaked), or is that 10k terms (and 490k leaked)? We've talked about using deleteOldPhonosFiles.php as a backstop, would we run it at 500k, and if so, why? What date would we use? If we did run it, would we then continue to run it at comparable thresholds? If so why? If not, why not? :)

I worry that if we pick some arbitrary check point and decide that everything looks good (based on...what?), that we'll all collectively move on and this will be forgotten. When/If it becomes a problem, it'll be some years down the road, and our future selves (if it's even us that is left to deal with this), will be trying to make sense of it. Our future selves (or successors, as the case may be), will be even less prepared to answer these questions.

TL;DR I think it's OK if we fly by the seat of our pants a bit, but we should definitely have a well documented Plan™, with next-steps, and it should be informed by better metrics than what we currently have; It's not enough to say we'll look at this in 6 months, if we don't know what we'll be looking at/for. In particular, some insight into the number of actual terms versus files stored would be immensely helpful.

If I'm being honest, I think that were we having this discussion in May when T309315 was opened, I'd be recommending something other than Swift

Not to get too side-tracked, I'm just curious: what would you have recommended other than Swift? What other ways are there to expose files shared globally across all wikis?

TL;DR I think it's OK if we fly by the seat of our pants a bit, but we should definitely have a well documented Plan™, with next-steps, and it should be informed by better metrics than what we currently have; It's not enough to say we'll look at this in 6 months, if we don't know what we'll be looking at/for. In particular, some insight into the number of actual terms versus files stored would be immensely helpful.

We've estimated that the number of terms will be at around the 8 million mark. See T323286.
The plan continues to be a very slow rollout on top of communities taking their time to vet and adopt the new parser tag so it is unlikely that we'll see a stampede of files being created all at once.
T324233 is almost done and it will give us an insight on terms vs files stored.
It seems that we have an agreement on T320835 so orphan files should not be a major problem

TL;DR I think it's OK if we fly by the seat of our pants a bit, but we should definitely have a well documented Plan™, with next-steps, and it should be informed by better metrics than what we currently have; It's not enough to say we'll look at this in 6 months, if we don't know what we'll be looking at/for. In particular, some insight into the number of actual terms versus files stored would be immensely helpful.

We've estimated that the number of terms will be at around the 8 million mark. See T323286.

Awesome; Thank you.

The plan continues to be a very slow rollout on top of communities taking their time to vet and adopt the new parser tag so it is unlikely that we'll see a stampede of files being created all at once.
T324233 is almost done and it will give us an insight on terms vs files stored.
It seems that we have an agreement on T320835 so orphan files should not be a major problem

To be clear: Setting the TTL won't do anything as of now. I'd like to install swift-object-expirer, and I think we have consensus that this is worthwhile (provided it is passes some testing/evaluation), but in the spirit of not counting chickens before they hatch, I don't think we can (yet) consider this a solved problem.

Given what I perceive as the likelihood of this happening though, it's great that we're already setting that header (so thanks for taking the initiative on that).