Page MenuHomePhabricator

CloudVPS: bogus hypervisor stats value reported by nova
Closed, ResolvedPublic

Description

Simple check:

aborrero@cloudmetrics1002:~ $ curl cloudcontrol1003.wikimedia.org:12345/metrics | grep cloudvirt1009 | grep running_vms
[..]
hypervisor_running_vms{aggregate="unknown",arch="x86_64",cloud="eqiad1-r",hypervisor_hostname="cloudvirt1009",nova_service_status="enabled"} 15.0
aborrero@cloudmetrics1002:~ $ curl cloudcontrol1004.wikimedia.org:12345/metrics | grep cloudvirt1009 | grep running_vms
[..]
hypervisor_running_vms{aggregate="unknown",arch="x86_64",cloud="eqiad1-r",hypervisor_hostname="cloudvirt1009",nova_service_status="enabled"} 15.0

However:

aborrero@cloudcontrol1004:~ $ sudo wmcs-openstack server list --host cloudvirt1009 --all-projects -f value | wc -l
28

This results in grafana dashboards with wrong data, such as https://grafana.wikimedia.org/d/aJgffPPmz/wmcs-openstack-eqiad1-hypervisor?orgId=1&refresh=30s&from=now-7d&to=now&var-hypervisor=cloudvirt1009

Event Timeline

aborrero triaged this task as High priority.Jan 7 2020, 8:54 AM
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

It seems this info is indeed contained somehow in the database, and prometheus-openstack-exporter is just reporting it.

I did 2 tests:

  • write my own python script to read the API
  • check what horizon is reporting

Screenshot_2020-01-07_10-19-55.png (470×2 px, 70 KB)

Both report 15 instead of the expected 28 VMs.

I wonder if our wmcs-cold-migrate script somehow introduce confusion in the database numbers.

This is actually stored in the dabatase!

$ MariaDB [nova_eqiad1]>~$ select host,running_vms from compute_nodes where host='cloudvirt1009';
+---------------+-------------+
| host          | running_vms |
+---------------+-------------+
| cloudvirt1009 |          15 |
+---------------+-------------+
1 row in set (0.00 sec)

Some upstream links that may be related:

Also, from a quick diagonal reading the source code of the nova resource tracker, it seems to me resource stats are only updated in certain situations (VM created/destroyed, migrated, etc).
But I've seen our migration script to trigger the resource stats before. Our migration script could be triggering some bug.

I'm not sure it worth spending more time on this. Having nova to refresh the stats could be as simple as re-creating the canary VM.

Mentioned in SAL (#wikimedia-cloud) [2020-01-07T10:02:24Z] <arturo> icinga downtime cloudvirt1009 for 30 minutes to re-create canary VM (T242078)

Mentioned in SAL (#wikimedia-cloud) [2020-01-07T10:07:52Z] <arturo> delete canary1009-01 VM to re-create it (T242078)

Mentioned in SAL (#wikimedia-cloud) [2020-01-07T10:24:33Z] <arturo> created canary1009-01 VM using horizon and the cold-migrate it to cloudvirt1009 from cloudvirt1006 (T242078)

This didn't work :-/ still showing only 15 VMs running on it.

At this point my impulse is to directly update the database but I want to double check with @Andrew and @JHedden before doing so.

out of curiosity I checked how spread the problem is:

DB:  cloudvirt1001             20  REAL: cloudvirt1001 20
DB:  cloudvirt1002             24  REAL: cloudvirt1002 21
DB:  cloudvirt1003             23  REAL: cloudvirt1003 23
DB:  cloudvirt1004             20  REAL: cloudvirt1004 20
DB:  cloudvirt1005             24  REAL: cloudvirt1005 24
DB:  cloudvirt1006             16  REAL: cloudvirt1006 24
DB:  cloudvirt1007             19  REAL: cloudvirt1007 19
DB:  cloudvirt1008             21  REAL: cloudvirt1008 21
DB:  cloudvirt1009             15  REAL: cloudvirt1009 28
DB:  cloudvirt1012             32  REAL: cloudvirt1012 32
DB:  cloudvirt1013              1  REAL: cloudvirt1013 1
DB:  cloudvirt1014             52  REAL: cloudvirt1014 52
DB:  cloudvirt1015              0  REAL: cloudvirt1015 0
DB:  cloudvirt1016             42  REAL: cloudvirt1016 42
DB:  cloudvirt1017             66  REAL: cloudvirt1017 66
DB:  cloudvirt1018             57  REAL: cloudvirt1018 57
DB:  cloudvirt1019              3  REAL: cloudvirt1019 3
DB:  cloudvirt1020              3  REAL: cloudvirt1020 3
DB:  cloudvirt1021             45  REAL: cloudvirt1021 45
DB:  cloudvirt1022              2  REAL: cloudvirt1022 2
DB:  cloudvirt1023              1  REAL: cloudvirt1023 1
DB:  cloudvirt1024              1  REAL: cloudvirt1024 1
DB:  cloudvirt1025             39  REAL: cloudvirt1025 42
DB:  cloudvirt1026             46  REAL: cloudvirt1026 54
DB:  cloudvirt1027             45  REAL: cloudvirt1027 48
DB:  cloudvirt1028             50  REAL: cloudvirt1028 51
DB:  cloudvirt1029             36  REAL: cloudvirt1029 36
DB:  cloudvirt1030             38  REAL: cloudvirt1030 38

The only affected servers are:

DB:  cloudvirt1002             24  REAL: cloudvirt1002 21
DB:  cloudvirt1006             16  REAL: cloudvirt1006 24
DB:  cloudvirt1009             15  REAL: cloudvirt1009 28
DB:  cloudvirt1025             39  REAL: cloudvirt1025 42
DB:  cloudvirt1026             46  REAL: cloudvirt1026 54
DB:  cloudvirt1027             45  REAL: cloudvirt1027 48
DB:  cloudvirt1028             50  REAL: cloudvirt1028 51

DB means the value reported by select host,running_vms from compute_nodes; and REAL means openstack server list --all-project --host cloudvirtXXX -f value | wc -l

aborrero renamed this task from CloudVPS: prometheus-openstack-exporter producing bogus metrics values to CloudVPS: bogus hypervisor stats value reported by nova.Jan 7 2020, 10:51 AM

Mentioned in SAL (#wikimedia-cloud) [2020-01-08T10:58:27Z] <arturo> re-creating all canary VMs in all hypervisors to refresh nova quota numbers T242078

I re-created all the canary VMs in every hypervisor. Unfortunately the stats are still messed:

cloudvirt1001 scheduled VMs: 20 DB stats: 20 
cloudvirt1002 scheduled VMs: 21 DB stats: 24 <---
cloudvirt1003 scheduled VMs: 23 DB stats: 23 
cloudvirt1004 scheduled VMs: 20 DB stats: 20 
cloudvirt1005 scheduled VMs: 24 DB stats: 24 
cloudvirt1006 scheduled VMs: 23 DB stats: 16 <---
cloudvirt1007 scheduled VMs: 19 DB stats: 19 
cloudvirt1008 scheduled VMs: 21 DB stats: 21 
cloudvirt1009 scheduled VMs: 28 DB stats: 15 <---
cloudvirt1012 scheduled VMs: 32 DB stats: 32 
cloudvirt1013 scheduled VMs: 1 DB stats: 1 
cloudvirt1014 scheduled VMs: 52 DB stats: 52 
cloudvirt1015 scheduled VMs: 0 DB stats:  0 <---
cloudvirt1016 scheduled VMs: 42 DB stats: 42 
cloudvirt1017 scheduled VMs: 66 DB stats: 66 
cloudvirt1018 scheduled VMs: 57 DB stats: 57 
cloudvirt1019 scheduled VMs: 3 DB stats: 3 
cloudvirt1020 scheduled VMs: 3 DB stats: 3 
cloudvirt1021 scheduled VMs: 45 DB stats: 45 
cloudvirt1022 scheduled VMs: 4 DB stats: 4 
cloudvirt1023 scheduled VMs: 1 DB stats: 1 
cloudvirt1024 scheduled VMs: 1 DB stats: 1 
cloudvirt1025 scheduled VMs: 42 DB stats: 39 <---
cloudvirt1026 scheduled VMs: 55 DB stats: 46 <---
cloudvirt1027 scheduled VMs: 48 DB stats: 45 <---
cloudvirt1028 scheduled VMs: 51 DB stats: 50 <---
cloudvirt1029 scheduled VMs: 36 DB stats: 36 
cloudvirt1030 scheduled VMs: 38 DB stats: 38

I believe there's a special tool to force regeneration of the quotas... I'll try to figure that out.

This looks better now:

aborrero@cloudcontrol1004:~ $ bash wmcs-hypervisor-stats.sh 
cloudvirt1001 scheduled VMs: 21 DB stats: 21 
cloudvirt1002 scheduled VMs: 21 DB stats: 21 
cloudvirt1003 scheduled VMs: 23 DB stats: 23 
cloudvirt1004 scheduled VMs: 20 DB stats: 20 
cloudvirt1005 scheduled VMs: 23 DB stats: 23 
cloudvirt1006 scheduled VMs: 22 DB stats: 22 
cloudvirt1007 scheduled VMs: 19 DB stats: 19 
cloudvirt1008 scheduled VMs: 21 DB stats: 21 
cloudvirt1009 scheduled VMs: 27 DB stats: 27 
cloudvirt1012 scheduled VMs: 30 DB stats: 30 
cloudvirt1013 scheduled VMs: 1 DB stats: 1 
cloudvirt1014 scheduled VMs: 51 DB stats: 51 
cloudvirt1015 scheduled VMs: 0 DB stats: 0 
cloudvirt1016 scheduled VMs: 39 DB stats: 39 
cloudvirt1017 scheduled VMs: 65 DB stats: 65 
cloudvirt1018 scheduled VMs: 57 DB stats: 57 
cloudvirt1019 scheduled VMs: 3 DB stats: 3 
cloudvirt1020 scheduled VMs: 3 DB stats: 3 
cloudvirt1021 scheduled VMs: 44 DB stats: 44 
cloudvirt1022 scheduled VMs: 3 DB stats: 3 
cloudvirt1023 scheduled VMs: 1 DB stats: 1 
cloudvirt1024 scheduled VMs: 1 DB stats: 1 
cloudvirt1025 scheduled VMs: 43 DB stats: 43 
cloudvirt1026 scheduled VMs: 53 DB stats: 53 
cloudvirt1027 scheduled VMs: 48 DB stats: 48 
cloudvirt1028 scheduled VMs: 52 DB stats: 52 
cloudvirt1029 scheduled VMs: 37 DB stats: 37 
cloudvirt1030 scheduled VMs: 38 DB stats: 38 

probably related to T241347: upgrade cloud-vps openstack to Openstack version 'Pike'.

Will reopen if I see more issues.