Page MenuHomePhabricator

"bigdisk" instance with "bigram" for Template Parameter Alignment Generation Scripts (Language team)
Closed, ResolvedPublic

Description

For tasks like T227183: Generate template parameter alignments for the selected small wikis, we (Language Team) require very large RAM and disk space and team's instances are exhausted. Currently, I'm running scripts on my laptop, which won't scale as we are adding more language pairs (each language requires 15 GB unzipped trained model and large amount of size in dumps and script outputs).

Current requirements is: 500 GB to 1 TB diskspace available to user with 32 GB RAM.

Event Timeline

Any updates on this?

It will be discussed in our 2020-01-15 team meeting (meeting day changed from Tuesday to Wednesday starting this calendar week).

@KartikMistry we discussed this request in our 2020-01-15 Cloud Services team meeting. One thing we wanted to know before making a decision is how "unique" the data you will be placing on this very large instance will be.

Our main concern is that we can not currently provide any strong guarantees that any instance storage is backed up. Your instance's disk will live on a single cloudvirt server with only RAID 10 providing storage redundancy. Instances with very large disks also suffer a lot more downtime when we have hardware issues (and sadly we have those often). A 500G image will take 2-3 hours to bring back online if the cloudvirt it is hosted on suffers a hardware problem, assuming that hardware problem is a soft failure which actually allows us hours to evacuate the instances it hosts. We are working on better solutions for this class of problem, but it will be several more months until we can make them available to Cloud VPS projects.

@bd808 Thanks!

Main usage of this instance will be to run script (and download Wikipedia dumps + language models quickly on faster connection). I'm expecting that no data will remain for more than few days there, so downtime is not much important for this.

Main usage of this instance will be to run script (and download Wikipedia dumps + language models quickly on faster connection). I'm expecting that no data will remain for more than few days there, so downtime is not much important for this.

This workflow sounds very similar to the core use case of the video project. The solution they use instead of extra large local disk is the "scratch" NFS server. Scratch is not replicated, so it should not be treated as long term storage. It is however a very large storage space that can be added to any Cloud VPS project.

I would like to suggest these next steps to evaluate using the scratch mount rather than jumping to extra large local instance disks:

  • WMCS folks will expose the "scratch" mount to the "language" Cloud VPS project
  • @KartikMistry can either use that mount on bigram instance, or if needed request a project quota increase to allow creating a new bigram instance for this work
  • Try out storage on the scratch volume and see if it gives enough performance to allow the work to progress

If that all works then we are done! If performance is not acceptable we can then revisit the idea of extra large local instance storage as a stop gap until we have better persistent volume management options for Cloud VPS instances (something that the WMCS team is actively working towards now).

@KartikMistry does this sound like a reasonable plan to you?

@KartikMistry does this sound like a reasonable plan to you?

Thread bump. I am pretty sure that @KartikMistry has not had a lot of time to think about this in the past week or two. I know I haven't.

@bd808 Sorry for late reply.

Please increase project quota, so we can build new instance as required.

Thanks!

Change 571827 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] nfs-mounts: expose the scratch mount in the language project

https://gerrit.wikimedia.org/r/571827

To create a second bigram instance in the language project (8 cores, 36G RAM), an additional 1 core and 34G of RAM quota will be needed. This makes the total quota 31 cores and 102G RAM:

$ ssh cloudcontrol1003.wikimedia.org
$ sudo wmcs-openstack quota show language
+----------------------+----------+
| Field                | Value    |
+----------------------+----------+
| cores                | 30       |
| fixed-ips            | -1       |
| floating_ips         | 0        |
| floatingip           | 0        |
| injected-file-size   | 10240    |
| injected-files       | 5        |
| injected-path-size   | 255      |
| instances            | 20       |
| key-pairs            | 100      |
| networks             | 100      |
| ports                | 500      |
| project              | language |
| project_name         | language |
| properties           | 128      |
| ram                  | 69632    |
| rbac-policies        | 10       |
| routers              | 10       |
| secgroup-rules       | 100      |
| secgroups            | 40       |
| server-group-members | 10       |
| server-groups        | 10       |
| subnetpools          | -1       |
| subnets              | 100      |
+----------------------+----------+
$ sudo wmcs-openstack quota set --cores 31 --ram 104448 language
$ sudo wmcs-openstack quota show language
+----------------------+----------+
| Field                | Value    |
+----------------------+----------+
| cores                | 31       |
| fixed-ips            | -1       |
| floating_ips         | 0        |
| floatingip           | 0        |
| injected-file-size   | 10240    |
| injected-files       | 5        |
| injected-path-size   | 255      |
| instances            | 20       |
| key-pairs            | 100      |
| networks             | 100      |
| ports                | 500      |
| project              | language |
| project_name         | language |
| properties           | 128      |
| ram                  | 104448   |
| rbac-policies        | 10       |
| routers              | 10       |
| secgroup-rules       | 100      |
| secgroups            | 40       |
| server-group-members | 10       |
| server-groups        | 10       |
| subnetpools          | -1       |
| subnets              | 100      |
+----------------------+----------+

Change 571827 merged by Bstorm:
[operations/puppet@production] nfs-mounts: expose the scratch mount in the language project

https://gerrit.wikimedia.org/r/571827

@KartikMistry you should now have quota space in the language project to create a new bigram instance. All of the instances in your project should also receive an NFS mount at /data/scratch of the shared scratch volume. Current best practice would be for you to create a /data/scratch/language directory inside the NFS mount to store your transient data.

Also, if you end up being able to delete your existing bigram instance after setting up the new one it would be greatly appreciated if you could ping this ticket letting us know that we be reduce your quota again. Its not a problem if you are making active good use of the resources, but we like to keep project quotas as low as possible to make planning for use of our shared servers easier.