Page MenuHomePhabricator

[cloudvirt-canary]Canaries are not going through
Closed, ResolvedPublic

Description

Got an email about puppet failing on one of the canary vms, checeked the project and there's a bunch of instances, looking

Event Timeline

dcaro triaged this task as High priority.Feb 18 2021, 8:55 AM
dcaro created this task.

Mentioned in SAL (#wikimedia-cloud) [2021-02-18T08:56:27Z] <dcaro> canary instances seem to be stuck, looking (T275111)

I was mistaking these canaries with the nova-fullstack tests, these are not leftovers but meant to be up continuously:

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Procedures_and_operations#Canary_VM_instance_in_every_hypervisor

Checking why puppet failed on this one.

The one that failed is canary1022-01.cloudvirt-canary.eqiad1.wikimedia.cloud

It seems to be out of memory and puppet crashes before finishing the run:

dcaro@canary1022-01:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:            481         348          18           5         115         115
Swap:             0           0           0

There's a process called 'diamond' running that takes most of the memory, will restart the machine but if it happens
again might be worth taking a closer look.

That seemed to do the trick. Weird that it uses diamond when we use prometheus by default...

Anyhow, will spend more time on it if it happens again.