Page MenuHomePhabricator

contintcloud instance refuses to launch due to "Maximum number of fixed ips exceeded
Closed, ResolvedPublic

Description

Since 2017-07-20 01:37:46 bunch of Nodepool instances refuse to launch with a:

LaunchStatusException: Server beb9cb86-5a32-4f5b-8bea-75a883ce2030 for node id: 748489 status: ERROR

I caught one using openstack server show which reported:

Build of instance bb4dfe79-4c14-4b0e-a75a-d58633448984 aborted:
Failed to allocate the network(s) with error Maximum number of fixed ips exceeded, not rescheduling.
'code': 500, u'created': u'2017-07-20T10:50:20Z'

FieldValue
OS-DCF:diskConfigMANUAL
OS-EXT-AZ:availability_zone
OS-EXT-STS:power_state0
OS-EXT-STS:task_stateNone
OS-EXT-STS:vm_stateerror
OS-SRV-USG:launched_atNone
OS-SRV-USG:terminated_atNone
accessIPv4
accessIPv6
addresses
config_drive
created2017-07-20T10:50:17Z
fault{u'message': u'Build of instance bb4dfe79-4c14-4b0e-a75a-d58633448984 aborted: Failed to allocate the network(s) with error Maximum number of fixed ips exceeded, not rescheduling.', u'code': 500, u'created': u'2017-07-20T10:50:20Z'}
flavorm1.medium (3)
hostId
idbb4dfe79-4c14-4b0e-a75a-d58633448984
imagesnapshot-ci-jessie-1500473642 (f7967e47-8b5b-4b0b-8da2-1b75d9273a8a)
key_nameNone
nameci-jessie-wikimedia-749798
os-extended-volumes:volumes_attached[]
project_idcontintcloud
properties
security_groups[{u'name': u'default'}]
statusERROR
updated2017-07-20T10:50:20Z
user_idnodepoolmanager

nova absolute-limits for contintcloud:

NameUsedMax
Cores1958
FloatingIps-10
ImageMeta-128
Instances929
Keypairs-100
Personality-5
Personality Size-10240
RAM40959118784
SecurityGroupRules-20
SecurityGroups-210
Server Meta-128
ServerGroupMembers-10
ServerGroups010

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 20 2017, 11:01 AM
hashar updated the task description. (Show Details)Jul 20 2017, 11:01 AM

labnet1001.eqiad.wmnet has a lot of such errors in /var/log/nova/nova-network.log*

The first suspicious one:

2017-07-20 01:34:42.679 14215 WARNING nova.network.manager
[req-e87b9df8-ec96-4e72-9fc4-b2fde54d3d8f nodepoolmanager contintcloud - - -]
Error cleaning up fixed ip allocation. Manual cleanup may be required

And there is some ValueError: Circular reference detected.

They pill up until the quota of 200 ip has been reached:

2017-07-20 06:23:18.797 15836 WARNING nova.network.manager
[req-a8c1e602-2d89-42ea-bc86-99c3ce15b017 nodepoolmanager contintcloud - - -]
[instance: 3b6d46aa-c2b6-4347-8706-fd8b13f96bc3]
Quota exceeded for project contintcloud, tried to allocate fixed IP.
200 of 200 are in use or are already reserved.

Which raises FixedIpLimitExceeded: Maximum number of fixed ips exceeded.

That is the same issue as T158350 which got fixed via:

I restarted nova-network and it looks like nova is cleaning up those leaks now. I'll keep an eye out, but I've reduced the quota to 200 and there's some slack now.

I cleaned up about 100 leaks, like this:
update fixed_ips a, instances b set a.instance_uuid=NULL where a.instance_uuid = b.uuid and project_id='contintcloud' and b.deleted!='0';
After that, unowned ips in contintcloud are staying in the single-digits an seem to be getting cleaned up regularly.

Seems the nova database is on m5-master.eqiad.wmnet db name nova.

Luke081515 triaged this task as High priority.Jul 20 2017, 12:01 PM
Andrew closed this task as Resolved.Jul 20 2017, 1:56 PM
Andrew claimed this task.

I resolved this by running the query in https://ask.openstack.org/en/question/494/how-to-reset-incorrect-quota-count/

BUT

the 'reserved' quota was also broken, so I also ran

update quota_usages set reserved='0' where project_id='contintcloud';

Reserved started to increment right after that so I think we're good.

I can confirm that resolved the issue completely. Thank you!