[cloudvirt-canary]Canaries are not going through
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Feb 18 2021, 8:55 AM

Description

Got an email about puppet failing on one of the canary vms, checeked the project and there's a bunch of instances, looking

Related Objects

Mentioned In: T275354: Puppet failures on many canary machines

Event Timeline

dcaro triaged this task as High priority.Feb 18 2021, 8:55 AM

dcaro created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 18 2021, 8:55 AM

Mentioned in SAL (#wikimedia-cloud) [2021-02-18T08:56:27Z] <dcaro> canary instances seem to be stuck, looking (T275111)

I was mistaking these canaries with the nova-fullstack tests, these are not leftovers but meant to be up continuously:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Procedures_and_operations#Canary_VM_instance_in_every_hypervisor

Checking why puppet failed on this one.

The one that failed is canary1022-01.cloudvirt-canary.eqiad1.wikimedia.cloud

It seems to be out of memory and puppet crashes before finishing the run:

dcaro@canary1022-01:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:            481         348          18           5         115         115
Swap:             0           0           0

There's a process called 'diamond' running that takes most of the memory, will restart the machine but if it happens
again might be worth taking a closer look.

That seemed to do the trick. Weird that it uses diamond when we use prometheus by default...

Anyhow, will spend more time on it if it happens again.

dcaro closed this task as Resolved.Feb 18 2021, 10:08 AM

dcaro mentioned this in T275354: Puppet failures on many canary machines.Feb 22 2021, 9:04 AM

[cloudvirt-canary]Canaries are not going throughClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

[cloudvirt-canary]Canaries are not going through
Closed, ResolvedPublic
Actions