Page MenuHomePhabricator

Beta cluster has reached its quota
Open, Needs TriagePublic

Description

I can't create new instances anymore:

image.png (68×299 px, 4 KB)

Currently it has reached number of VCPUs:
image.png (188×161 px, 5 KB)

I'd recommend people go through instances they have created and delete the ones that they don't need anymore (I found one for chromium, two for sentry, one or two for flourine, two mx nodes, etc.).

We can also increase the quota.

https://openstack-browser.toolforge.org/project/deployment-prep

List of instances

    • deployment-acme-chief03
    • deployment-acme-chief04
    • deployment-aqs01
    • deployment-aqs02
    • deployment-aqs03
    • deployment-cache-text06
      • Needed
    • deployment-cache-upload06
      • Needed
    • deployment-changeprop
      • Deleted
    • deployment-chromium01
      • Needed
    • deployment-chromium02
      • Deleted
    • deployment-cpjobqueue
      • Deleted
    • deployment-cumin02
    • deployment-cumin
      • Needed for all-project commands
    • deployment-db05
      • Needed
    • deployment-db06
      • Needed
    • deployment-deploy01
      • Needed
    • deployment-deploy02
    • deployment-docker-changeprop01
      • Needed
    • deployment-docker-citoid01
      • Needed
    • deployment-docker-cpjobqueue01
      • Needed
    • deployment-docker-cxserver01
      • Needed
    • deployment-docker-mathoid01
      • Needed
    • deployment-echostore01
      • Needed
    • deployment-elastic05
    • deployment-elastic06
    • deployment-elastic07
      • If you don't have three you will have more manual maintenance when things break and data is lost.
    • deployment-etcd-01
      • used by MediaWiki.
    • deployment-eventgate-3
      • Needed
    • deployment-eventlog05
      • Needed
    • deployment-eventstreams-1
      • Needed
    • deployment-fluorine02
      • Needed
    • deployment-imagescaler01
      • Seems to be unused --Urbanecm
      • Deleted
    • deployment-imagescaler02
      • Seems to be unused --Urbanecm
      • Deleted
    • deployment-imagescaler03
      • Needed; deployment-ms-fe03 talks to this
    • deployment-ircd
    • deployment-jobrunner03
      • Needed
    • deployment-kafka-jumbo-1
      • Needed
    • deployment-kafka-jumbo-2
      • Needed
    • deployment-kafka-main-1
      • Needed
    • deployment-kafka-main-2
      • Needed
    • deployment-logstash03
    • deployment-logstash2 T238707: Migrate from deployment-logstash2 (jessie) to deployment-logstash03 (stretch)
    • deployment-mailman01
      • Deleted
    • deployment-maps05
      • Deleted
    • deployment-mcs01
      • Needed
    • deployment-mdb01
      • Needed
    • deployment-mediawiki-07
      • Needed
    • deployment-mediawiki-09
      • Needed
    • deployment-memc04
    • Needed
    • deployment-memc05
  • Needed
    • deployment-memc06
  • Needed
    • deployment-memc07
  • Needed
    • deployment-memc08
  • Needed
    • deployment-ms-be05
      • Needed
    • deployment-ms-be06
      • Needed
    • deployment-ms-fe03
      • Needed
    • deployment-mwmaint01
      • Needed
    • deployment-mx02
    • deployment-ores01
    • deployment-parsoid11
      • Needed
    • deployment-poolcounter06
      • Needed
    • deployment-prometheus02
      • Needed
    • deployment-puppetdb03
      • Needed
    • deployment-puppetmaster04
      • Needed
    • deployment-push-notifications01
      • Needed
    • deployment-restbase01
      • Needed
    • deployment-restbase02
      • Needed
    • deployment-restbase03
      • Needed
    • deployment-sca01
      • Needed
    • deployment-sca02
      • Needed
    • deployment-sca04
    • deployment-schema-2
      • Needed
    • deployment-sentry01
      • Deleted
    • deployment-sessionstore03
      • Needed
    • deployment-snapshot01
    • deployment-urldownloader02
      • Needed
    • deployment-wdqs01
    • deployment-webperf11
      • Needed
    • deployment-webperf12
      • Needed
    • deployment-wikifeeds01
      • Needed
    • deployment-xhgui01
      • Already deleted
    • deployment-xhgui02
      • Already deleted
    • deployment-xhgui03
      • Needed
    • deployment-zookeeper02
      • Needed

Event Timeline

List of instances:

Instance NameVCPUsRAM (MB)Disk (GB)Usage (Hours)Age (Seconds)State
deployment-acme-chief03120482039.4640926995Active
deployment-acme-chief04120482039.4640924947Active
deployment-aqs01240964039.4643385428Active
deployment-aqs02240964039.4643372228Active
deployment-aqs03240964039.4643372210Active
deployment-cache-text06240964039.46rEEVL7260971c7df2Active
deployment-cache-upload06240964039.467261651Active
deployment-changeprop120482039.4651182405Stopped
deployment-chromium01120482039.4651215972Active
deployment-chromium02120482039.4651224204Active
deployment-cpjobqueue240964039.4651210191Stopped
deployment-cumin02120482039.4638848120Active
deployment-cumin120482039.461055004Active
deployment-db0581638416039.4643792600Active
deployment-db0681638416039.4640429759Active
deployment-deploy01881926039.4651223631Active
deployment-deploy02881926039.4651223229Active
deployment-docker-changeprop01120482039.464413318Active
deployment-docker-citoid01120482039.4636043832Active
deployment-docker-cpjobqueue01120482039.462071846Active
deployment-docker-cxserver01120482039.4635988148Active
deployment-docker-mathoid01120482039.4636472431Active
deployment-echostore01120482039.4617170994Active
deployment-elastic05481928039.4651154703Active
deployment-elastic06481928039.4651155828Active
deployment-elastic07481928039.4651152564Active
deployment-etcd-01120482039.4651201269Active
deployment-eventgate-3120482039.4621411224Active
deployment-eventlog05481928039.4651208642Active
deployment-eventstreams-1120482039.4615548061Active
deployment-fluorine02120488039.4651193096Active
deployment-imagescaler01240964039.4651190311Active
deployment-imagescaler02240964039.4651202702Active
deployment-imagescaler03240964039.4649250051Active
deployment-ircd120482039.4651194397Active
deployment-jobrunner03481928039.4651218925Active
deployment-kafka-jumbo-1120488039.4651203682Active
deployment-kafka-jumbo-2120488039.4651201263Active
deployment-kafka-main-1120482039.4651213003Active
deployment-kafka-main-2120482039.4651218210Active
deployment-logstash0381638416039.4637060051Active
deployment-logstash281638416039.4651182252Active
deployment-mailman01120482039.462419715Active
deployment-maps05481928039.4639724237Active
deployment-mcs01120482039.4651195371Active
deployment-mdb01120488039.462122503Active
deployment-mediawiki-07481928039.4651211839Active
deployment-mediawiki-09481928039.4651216632Active
deployment-memc04240964039.4651191933Active
deployment-memc05240964039.4651183858Active
deployment-memc06240964039.4651184243Active
deployment-memc07240964039.4651204969Active
deployment-memc08240964039.4622661725Active
deployment-ms-be0581638416039.4638761194Active
deployment-ms-be0681638416039.4638761194Active
deployment-ms-fe03120482039.4638232895Active
deployment-mwmaint01120488039.4651223997Active
deployment-mx02120482039.4651211782Active
deployment-ores01481928039.4651219547Active
deployment-parsoid11240964039.4610590936Active
deployment-poolcounter06120482039.4618754786Active
deployment-prometheus02481928039.4646161569Active
deployment-puppetdb03120482039.4614229837Active
deployment-puppetmaster04240964039.4612932186Active
deployment-push-notifications01120482039.462832947Active
deployment-restbase01481928039.4651180967Active
deployment-restbase02481928039.4651189611Active
deployment-restbase03481928039.467229633Active
deployment-sca01120482039.4651184163Active
deployment-sca02120482039.4651165143Active
deployment-sca04240964039.4651198853Active
deployment-schema-2120482039.4634200080Active
deployment-sentry01240964039.4651183953Active
deployment-sessionstore03120482039.4617171015Active
deployment-snapshot01240964039.4651207377Active
deployment-urldownloader02120482039.4651224813Active
deployment-wdqs01481928039.4621527681Active
deployment-webperf11120482039.4651219955Active
deployment-webperf12120482039.4651221874Active
deployment-wikifeeds01120482039.4633600161Active
deployment-xhgui01120482039.4619310473Active
deployment-xhgui02120482039.462042306Active
deployment-zookeeper02120482039.4617299874Active

We have two appservers but five memcached nodes, that seems off.

Deleted deployment-sentry01 according to T106915#6279270

We have two appservers but five memcached nodes, that seems off.

Not really, given the amount of traffic beta receives, and the fact we don't have load-balancing so it's hard to distribute requests more.

Most of the VMs you listed above are still in use, even if no one logs into them or touches them since some time.

I would ask why do we have mailman VMs in deployment-prep, OTOH. It seems quite off-topic there.

The two stopped changeprop VMs can be removed, I think, but I'd ask @hnowlan to confirm that's the case

I would ask why do we have mailman VMs in deployment-prep, OTOH. It seems quite off-topic there.

It's the Mailman v3 testing instance as per T52864.

Most of the VMs you listed above are still in use, even if no one logs into them or touches them since some time.

Indeed but this project is huge and this is list of all VMs, if we just audit and clean 10-20%, it frees up 20-40 VCPUs. For comparison, "meet" project has 8 VCPUs as the quota.

I would ask why do we have mailman VMs in deployment-prep, OTOH. It seems quite off-topic there.

Yup, that's what I'm working on: https://lists-beta.wmflabs.org I can also request a dedicated project if you think that's better.

Urbanecm updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-releng) [2020-10-05T22:29:21Z] <Amir1> deleted deployment-imagescaler01 and deployment-imagescaler02 (T257118)

Mentioned in SAL (#wikimedia-releng) [2020-10-05T22:31:06Z] <Amir1> deleted deployment-mailman01 (T257118)

Krinkle added a subscriber: Krinkle.

deployment-poolcounter06

I've marked this as "Needed" on behalf of Platform Engineering team. This is the primary and only poolcounter node, and is indeed being used per LabsServices.

cache-text, cache-upload, etcd, ircd, and memc are most certainly also still in use and we need at least one of each.

For memc we can most likely size down, I don't think we added this many intentionally but rather added new ones during rebuilds/upgrades and maybe left old ones?

cc @RLazarus Do you know if any of these were used for tests, and whether we could e.g. get a way with just one or two of the deployment-memc* instances? If so, where do we need to check for configs and update things etc. is the latest one the "right" one to keep?

cache-text, cache-upload, etcd, ircd, and memc are most certainly also still in use and we need at least one of each.

For memc we can most likely size down, I don't think we added this many intentionally but rather added new ones during rebuilds/upgrades and maybe left old ones?

cc @RLazarus Do you know if any of these were used for tests, and whether we could e.g. get a way with just one or two of the deployment-memc* instances? If so, where do we need to check for configs and update things etc. is the latest one the "right" one to keep?

I don't know much about this, but @Joe addressed the same question in T257118#6279378:

We have two appservers but five memcached nodes, that seems off.

Not really, given the amount of traffic beta receives, and the fact we don't have load-balancing so it's hard to distribute requests more.

I suspect some or all of the Kafka hosts (it's required for changeprop to behave normally) need to be kept but I don't know which.

The situation of the memcached servers is amazingly telling of how deployment-prep is unmanaged, and we should really dedicate some resources to it, but also work better when we do stuff there.

The memcached servers are used primarily for memcached, and then for redis (mainstash) and redis (locking).

See below the usage matrix:

servermemcachedmainstashlocks
memc04xxx
memc05xxx
memc06x
memc07x
memc08x

My proposal would be to move everything to memc06-08, and remove memc04/05 soon, but this will need to be scheduled and done by someone. In the meantime, all those servers are currently serving traffic.

Joe updated the task description. (Show Details)

The situation of the memcached servers is amazingly telling of how deployment-prep is unmanaged, and we should really dedicate some resources to it, but also work better when we do stuff there.

T215217: deployment-prep: Code stewardship request

Mentioned in SAL (#wikimedia-releng) [2020-11-19T00:25:33Z] <Amir1> shutting off deployment-aqs* instances (T257118)

Change 641855 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/recommendation-api/deploy@master] Drop deployment-sca04 from betacluster deploys

https://gerrit.wikimedia.org/r/641855

Change 641855 merged by jenkins-bot:
[mediawiki/services/recommendation-api/deploy@master] Drop deployment-sca04 from betacluster deploys

https://gerrit.wikimedia.org/r/641855

Mentioned in SAL (#wikimedia-releng) [2020-11-19T20:30:29Z] <Amir1> delete deployment-aqs* instances (T257118)

Mentioned in SAL (#wikimedia-releng) [2020-11-19T20:30:39Z] <Amir1> shut down deployment-sca04 (T257118)

The project is now down to 67 / 100 instances, Should the task now be closed off.

Number of VCPUs is still at 84% but that's not that bad

image.png (170×224 px, 5 KB)