Page MenuHomePhabricator

Request increased quota for contintcloud labs project
Closed, DuplicatePublic

Description

Project Name: contintcloud
Type of quota increase requested: instances (from 10 to 20)

Blocking this quota increase:

  • Find a metric that tracks why this increase is required - See T139771
  • Find a way around DNS leaks, a lot of which seem to be from contintcloud - See T115194

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 12 2016, 10:32 PM
tom29739 renamed this task from Request increased quota for <Replace Me> labs project to Request increased quota for contintcloud labs project.Aug 12 2016, 10:34 PM
yuvipanda changed the task status from Open to Stalled.Aug 12 2016, 10:36 PM
yuvipanda added subscribers: thcipriani, greg.

Copying from T139771#2549637

We had an outage for CI 2 night ago and during that we discovered that nodepool seems to be waiting only 1s before declaring build on a VM faulty and then issuing a delete and then eventually churning on its own quota limitations. This happened because we upped the timeout allowance for instance creation as we have larger and larger projects with relative rule sets. During debugging we also discovered issues with quota tracking and nodepool. Nova seems to have no clear idea on instance count for the project displaying greater than 32k instances, and were also fighting DNS leaks all over the place making it unclear what is and is not an actual CI instance. I have a suspicion this DNS leak issue is related to the rate and tolerance of instance creation and lost messaging on the part of nodepool/rabbitmq. I talked to @thcipriani and @greg briefly post incident due to the difficult nature of debugging through this. I do not believe we can't go any further with nodepool without addressing these issues. I believe we are in agreement about this overall. The DNS leaks cleanup we are doing periodically is getting way out of hand and seems like a canary for deeper issues.

hashar added a subscriber: hashar.Aug 14 2016, 9:31 PM

The DNS leak is tracked by T115194 there are good indications that is an old issue and the 32K leftover entries are artifacts of an unknown issue. I would recommend to drop them entirely from the Designate MySQL database.

In my experience quota tracking worked fine before I have left for vacations. OpenStack got upgraded to Liberty meanwhile, maybe the openstack python module on labnodepool1001 needs to be upgraded to catch up with the version?

As for the quota bump. It has been to 20 instances for most of 2016. It has been lowered to 10 instances in an emergency on July 4th since the labs infra was out of RAM (in part due to multiple projects having created x.large instances + tools labs that allocated 100+GB RAM over June).

I would have expected the quota to be bumped back to its original 20 instances (or 80GB of RAM). Tracking a proper metric to scale the pool is T139771: Identify metric (or metrics) that gives a useful indication of user-perceived (Wikimedia developer) service of CI and that metric would be used to back up the raise of the pool to even more instances (probably up to 40) which is T133911: Bump quota of Nodepool instances (contintcloud tenant).

hashar added a comment.Sep 9 2016, 9:04 AM

I am closing this task. The quota used to be 20 until July 4th when it got lowered down to 10 in an emergency due to wmflabs being full. We get more labvirt nodes nowadays.

After the Liberty upgrade, the quota usage started being off, so we got most jobs moved back to the permanent slaves. Since then, we have slowly moved the jobs back to Nodepool and raised the quota to 15 (with Nodepool using a max of 12). That is tracked by T143938.

The main task to further bump the quota to 40 instances is T133911 filled in May and blocking the rest of the migration since then. Marking this as a dupe of T133911.