Page MenuHomePhabricator

Request increased quota for wikidata-query Cloud VPS project
Closed, ResolvedPublic

Description

Project Name: wikidata-query
Type of quota increase requested: +32vCPU, +132G ram, +3.4T disk
Reason:

Baremetal servers were added to WMCS for WDQS, but we can't boot them because they are larger than our quota. Ref T206636

Instance name: t206636
Instance size: 32 vCPU, 132G ram, 3.4T disk

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Nintendofan885 renamed this task from Request increased quota for <Replace Me> Cloud VPS project to Request increased quota for wikidata-query Cloud VPS project.Jul 7 2020, 4:24 PM
dcausse triaged this task as High priority.Jul 8 2020, 1:32 PM

raising prio as this is blocking T251488

bd808 added a subscriber: bd808.

+1 to just do this now. The real quota for this project is their dedicated hardware.

Quota bumped.

$ sudo wmcs-openstack quota show wikidata-query
+----------------------+----------------+
| Field                | Value          |
+----------------------+----------------+
| cores                | 17             |
| fixed-ips            | -1             |
| floating-ips         | 1              |
| health_monitors      | None           |
| injected-file-size   | 10240          |
| injected-files       | 5              |
| injected-path-size   | 255            |
| instances            | 8              |
| key-pairs            | 100            |
| l7_policies          | None           |
| listeners            | None           |
| load_balancers       | None           |
| location             | None           |
| name                 | None           |
| networks             | 100            |
| pools                | None           |
| ports                | 500            |
| project              | wikidata-query |
| project_name         | wikidata-query |
| properties           | 128            |
| ram                  | 51200          |
| rbac_policies        | 10             |
| routers              | 10             |
| secgroup-rules       | 100            |
| secgroups            | 40             |
| server-group-members | 10             |
| server-groups        | 10             |
| subnet_pools         | -1             |
| subnets              | 100            |
+----------------------+----------------+
$ sudo wmcs-openstack quota set --ram 186368 --cores 64 wikidata-query
$ sudo wmcs-openstack quota show wikidata-query
+----------------------+----------------+
| Field                | Value          |
+----------------------+----------------+
| cores                | 64             |
| fixed-ips            | -1             |
| floating-ips         | 1              |
| health_monitors      | None           |
| injected-file-size   | 10240          |
| injected-files       | 5              |
| injected-path-size   | 255            |
| instances            | 8              |
| key-pairs            | 100            |
| l7_policies          | None           |
| listeners            | None           |
| load_balancers       | None           |
| location             | None           |
| name                 | None           |
| networks             | 100            |
| pools                | None           |
| ports                | 500            |
| project              | wikidata-query |
| project_name         | wikidata-query |
| properties           | 128            |
| ram                  | 186368         |
| rbac_policies        | 10             |
| routers              | 10             |
| secgroup-rules       | 100            |
| secgroups            | 40             |
| server-group-members | 10             |
| server-groups        | 10             |
| subnet_pools         | -1             |
| subnets              | 100            |
+----------------------+----------------+

After the quota bump @EBernhardson was able to create a new instance using the custom t206636 flavor, but it was scheduled on cloudvirt1024 rather than on one of the cloudvirt-wdqs* hypervisors so we need to fix something about how this works via Horizon. It seems likely that we need to tweak something about the flavor to make the scheduler know where the instances belong.

The hypervisors are marked with the expected host aggregates

+--------------+--------------------+
| Field        | Value              |
+--------------+--------------------+
| aggregates   | ['wdqs']           |
| service_host | cloudvirt-wdqs1001 |
+--------------+--------------------+
+--------------+--------------------+
| Field        | Value              |
+--------------+--------------------+
| aggregates   | ['wdqs']           |
| service_host | cloudvirt-wdqs1003 |
+--------------+--------------------+
+--------------+--------------------+
| Field        | Value              |
+--------------+--------------------+
| aggregates   | ['wdqs']           |
| service_host | cloudvirt-wdqs1002 |
+--------------+--------------------+

I set aggregate_instance_extra_specs:wdqs='true' on that flavor; if you try recreating it should end up on the right hardware.

hmmm... the flavor is marked with the expected host aggregate as well.

$ sudo wmcs-openstack flavor show t206636
+----------------------------+--------------------------------------------+
| Field                      | Value                                      |
+----------------------------+--------------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                      |
| OS-FLV-EXT-DATA:ephemeral  | 0                                          |
| access_project_ids         | wikidata-query                             |
| disk                       | 3481                                       |
| id                         | 148d82bc-dc07-481c-b717-ae0b4d78417c       |
| name                       | t206636                                    |
| os-flavor-access:is_public | False                                      |
| properties                 | aggregate_instance_extra_specs:wdqs='true' |
| ram                        | 135168                                     |
| rxtx_factor                | 1.0                                        |
| swap                       |                                            |
| vcpus                      | 32                                         |
+----------------------------+--------------------------------------------+

So how did the instance end up on cloudvirt1024 which is actually marked as 'spare' currently?

Maybe there is some regression from the Rocky upgrades? @Andrew I think I need your help. :)

hmmm... the flavor is marked with the expected host aggregate as well.

This came after the flavor change that @Andrew documented in T257336#6290470

This flavor was broken mostly because it asked for way too many cores (as well as slightly to much RAM). I've adjusted it as needed and things should be working better now.

Note that due to a different snafu (recorded in T219078) two of the wdqs hypervisors are busy with unrelated workloads. I'll keep an eye on that and try to get them cleared out soon.

https://openstack-browser.toolforge.org/server/wcqs-beta-01.wikidata-query.eqiad.wmflabs is using the flavor and scheduled on one of the right hypervisors. Per @Andrew in T257336#6290999, you will not be able to make the other 2 instances until we push existing instances off of cloudvirt-wdqs1001 and cloudvirt-wdqs1003.

@Andrew for the moment we only need the one, so no rush on the others. I'm pretty sure there are testing plans for the others, but they aren't at the top of the stack currently.