Page MenuHomePhabricator

"bigdisk" instance with "bigram" for Template Parameter Alignment Generation Scripts (Language team)
Open, Needs TriagePublic

Description

For tasks like T227183: Generate template parameter alignments for the selected small wikis, we (Language Team) require very large RAM and disk space and team's instances are exhausted. Currently, I'm running scripts on my laptop, which won't scale as we are adding more language pairs (each language requires 15 GB unzipped trained model and large amount of size in dumps and script outputs).

Current requirements is: 500 GB to 1 TB diskspace available to user with 32 GB RAM.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Jan 8, 5:57 AM

Any updates on this?

bd808 added a subscriber: bd808.Wed, Jan 15, 5:45 AM

Any updates on this?

It will be discussed in our 2020-01-15 team meeting (meeting day changed from Tuesday to Wednesday starting this calendar week).

bd808 added a comment.Wed, Jan 15, 9:41 PM

@KartikMistry we discussed this request in our 2020-01-15 Cloud Services team meeting. One thing we wanted to know before making a decision is how "unique" the data you will be placing on this very large instance will be.

Our main concern is that we can not currently provide any strong guarantees that any instance storage is backed up. Your instance's disk will live on a single cloudvirt server with only RAID 10 providing storage redundancy. Instances with very large disks also suffer a lot more downtime when we have hardware issues (and sadly we have those often). A 500G image will take 2-3 hours to bring back online if the cloudvirt it is hosted on suffers a hardware problem, assuming that hardware problem is a soft failure which actually allows us hours to evacuate the instances it hosts. We are working on better solutions for this class of problem, but it will be several more months until we can make them available to Cloud VPS projects.

@bd808 Thanks!

Main usage of this instance will be to run script (and download Wikipedia dumps + language models quickly on faster connection). I'm expecting that no data will remain for more than few days there, so downtime is not much important for this.

bd808 added a comment.Wed, Jan 22, 6:32 PM

Main usage of this instance will be to run script (and download Wikipedia dumps + language models quickly on faster connection). I'm expecting that no data will remain for more than few days there, so downtime is not much important for this.

This workflow sounds very similar to the core use case of the video project. The solution they use instead of extra large local disk is the "scratch" NFS server. Scratch is not replicated, so it should not be treated as long term storage. It is however a very large storage space that can be added to any Cloud VPS project.

I would like to suggest these next steps to evaluate using the scratch mount rather than jumping to extra large local instance disks:

  • WMCS folks will expose the "scratch" mount to the "language" Cloud VPS project
  • @KartikMistry can either use that mount on bigram instance, or if needed request a project quota increase to allow creating a new bigram instance for this work
  • Try out storage on the scratch volume and see if it gives enough performance to allow the work to progress

If that all works then we are done! If performance is not acceptable we can then revisit the idea of extra large local instance storage as a stop gap until we have better persistent volume management options for Cloud VPS instances (something that the WMCS team is actively working towards now).

@KartikMistry does this sound like a reasonable plan to you?