Page MenuHomePhabricator

Request increased quota for wikiwho Cloud VPS project (volume storage)
Closed, ResolvedPublic

Description

Project Name: wikiwho
Type of quota increase requested: Volume storage
Amount of quota increase: 5 TB (~4656 GB - existing 80 GB = 4576 GB ?)
Reason: This is a follow-up to T290768 / T295818. We now have a working WikiWho installation on VPS! The next step is to get the volume storage for the Python Pickle files. In order to accommodate the five Wikipedia languages currently supported by the old external WikiWho service (English, German, Basque, Turkish, Spanish), and to allow a little room to grow, we need a whopping 5 TB of storage. According to Amazon's AWS calculator, this will cost around 6,144.00 USD a year. We were informed the budget was approved by @marcella, Director of Engineering, and payment inquiries can be directed to her.

When creating a volume in Horizon, it asks for the amount in GB. As I understand 5 TB equates to about 4656 GB, and we already have an 80 GB quota so that leaves 4576 GB for the amount of increase. But I don't think exact numbers are important here, if it matters on your end. We just need roughly 5 TB. Additionally, if it wasn't clear, the long-term plan (sometime in 2022) is to migrate WikiWho to WMF production, using a much more efficient storage system through the API platform, so the requested 5 TB volume here and its costs would be temporary. The VPS solution is just to hold us off until we do the production deployment properly, as https://www.wikiwho.net/ is slated to be retired in early 2022.

Ideally we'd have this volume storage request fulfilled by 2022 (the sooner the better), but given the size of this request I'm unsure if it will take longer to service it. If time bleeds into 2022 that should be fine, but not for very long as we were told there's basically no one monitoring the old WikiWho service come the new year.

Thank you as always!

Event Timeline

Hello!

I'm not able to provide a clear answer about this storage quota just yet. In the meantime, though, can you tell me more about 'migrate WikiWho to WMF production, using a much more efficient storage system through the API platform'? I'm wondering specifically about what makes it more efficient, and why the API platform is an option in prod but not in cloud-vps.

Hello!

I'm not able to provide a clear answer about this storage quota just yet. In the meantime, though, can you tell me more about 'migrate WikiWho to WMF production, using a much more efficient storage system through the API platform'? I'm wondering specifically about what makes it more efficient, and why the API platform is an option in prod but not in cloud-vps.

The current externally hosted WikiWho relied upon by our tools is slated to become unmaintained in January 2022. We are trying to replicate that stack on VPS to hold us off until we can do a proper production deployment (see research task at T293386). The current stack (F34639572) uses Python pickles to store the attribution data, which we believe to be more inefficient than other storage systems such as Cassandra used by the API Platform. However in order improve the storage footprint we need to rewrite parts of WikiWho, which is not something we'll be able to complete before the old service is retired. Hence why the massive 5 TB volume is just a temporary solution. Once the relevant bits are rewritten, it may be possible to utilize API Platform in tandem with VPS, but I was not aware of this. The ultimate goal is for WikiWho to live entirely on production, though, after which we would retire the wikiwho VPS project.

From T290768, we were under the impression the primary barrier for the VPS solution was financing the hardware, and we have gotten budget approval for that. How difficult do you think it would be to acquire the hardware, and how long would it take? I would have filed this task sooner, but given the size of the request, I wanted to first prove that we could get WikiWho working in the VPS environment.

@Andrew As I am a new engineering manager of the CommTech team, I had to figure out first what the process is to transfer from our budget to the Cloud Services budget and I initiated that transfer last night by email to Nicholas.

I think we can accommodate 5TB of storage. We have like ~45TB storage free in our ceph storage backend as of this writing.

+1 on my side.

+1 from me as well. This has been done and is now active on the project. Please note this would be the largest cinder volume in use and we would appreciate your feedback on how well it works and the performance you see, etc. Let this also serve as a disclaimer that we've never attempted to create such a large volume for real world use. Step boldly! Then share with us what happens. Thanks!

Mentioned in SAL (#wikimedia-cloud) [2021-12-15T17:37:07Z] <balloons> Bumped storage quota to 5150GB T297446

+1 from me as well. This has been done and is now active on the project. Please note this would be the largest cinder volume in use and we would appreciate your feedback on how well it works and the performance you see, etc. Let this also serve as a disclaimer that we've never attempted to create such a large volume for real world use. Step boldly! Then share with us what happens. Thanks!

Reporting back that I temporarily switched XTools to use our new instance, which makes ~5,000 request a day, and there were no hiccups whatsoever! From testing side by side with the old external installation, our VPS installation seems to perform at least twice as fast on average. So while we still haven't tested with the full amount of traffic we'll receive once we're done with the migration, the initial results look very good! The read/write speeds for the 5 TB volume seem to be fine. The only issue I ever ran into was how slow it is to count files (i.e. ls /pickles/en | wc -l) and check the available disk space, etc., but we fully expected that to be slow-ish given just how many files there are and their sizes.

TL;DR: it performs great! Thank you :)

Hello!

I'm sorting out which cinder volumes do and don't need backing up, with an extra-suspicious look at large users. So, a question for you: in the event of the pickle-storage volume being destroyed, how difficult would it be to recreate the data? I'm unclear on if this volumes serves as a cache that can be regenerated from database data or if it has valuable data that cannot be recreated. Please let me know!

ty

Just a note for posterity: given the lack of response to my latest comment I am excluding this volume from backups. It's almost too big for us to manage anyway.