wikitech api list=novainstances not returning list of instances
Closed, ResolvedPublic

Description

It looks like https://wikitech.wikimedia.org/w/api.php stopped returning results around Jul 20th 19:15 UTC

$ curl 'https://wikitech.wikimedia.org/w/api.php?action=query&list=novainstances&niregion=eqiad&format=json&niproject=deployment-prep'
{"batchcomplete":"","query":{"novainstances":[]}}

I noticed this because it is used by prometheus to gather a list of instances to monitor

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2017, 9:09 AM
fgiunchedi renamed this task from wikitech api action=query not returning list of instances to wikitech api list=novainstances not returning list of instances.Jul 21 2017, 9:11 AM
hashar added a subscriber: hashar.Jul 21 2017, 9:42 AM

Code is in api/ApiListNovaInstances.php. Replaying it on silver:

$ mwscript eval.php --wiki=labswiki
> global $wgOpenStackManagerLDAPUsername;
> global $wgOpenStackManagerLDAPUserPassword;
> $user = new OpenStackNovaUser( $wgOpenStackManagerLDAPUsername );

> $userNova = OpenStackNovaController::newFromUser( $user );
> $userNova->authenticate( $wgOpenStackManagerLDAPUsername, $wgOpenStackManagerLDAPUserPassword );
> $userNova->setProject( 'deployment-prep' );
> $userNova->setRegion( 'eqiad' );

> $instances = $userNova->getInstances();
> var_dump( $instances );
array(0) {
}
> 

From labnet1001.eqiad.wmnet in /var/log/nova/nova-api.log:

2017-07-21 09:33:53.195 30286 INFO nova.osapi_compute.wsgi.server [-] 208.80.154.136 "GET /v2/deployment-prep/servers/detail HTTP/1.1" status: 401 len: 291 time: 0.0015159
2017-07-21 09:33:53.398 30294 INFO nova.osapi_compute.wsgi.server [-] 208.80.154.136 "GET /v2/deployment-prep/servers/detail HTTP/1.1" status: 401 len: 291 time: 0.0009911

401: not authenticated

There is not much details though. Seems novaadmin fails to authenticate with keystone or it is no more allowed to list servers detail from Compute.

And in the nova logs, I also see 401 for the tools project for requests from Silver

"GET /v2/tools/servers/detail HTTP/1.1" status: 401 len: 291
fgiunchedi added a comment.EditedJul 21 2017, 10:44 AM

There's also a icinga alert for novaadmin has roles in every project which I believe it is related, asking for instances in a project not listed results in correct replies

Roles for novaadmin are not set in these projects: set([u'phabricator', u'packaging', u'analytics', u'puppet', u'netdata', u'openstack', u'newsletter', u'maps-team', u'account-creation-assistance', u'servermon', u'testlabs', u'wikidata-dev', u'ores-staging', u'contributors', u'maps', u'huggle', u'tools', u'lizenzhinweisgenerator', u'integration', u'twl', u'wmt', u'kubernetes-testing', u'bastion', u'osmit', u'etcd', u'queryrapi', u'otrs', u'wikidata-federation', u'math', u'project-proxy', u'catgraph', u'wikidataconcepts', u'shinken', u'openocr', u'puppet3-diffs', u'search', u'scrumbugz', u'mwfileimport', u'services', u'wikidata-query', u'toolsbeta', u'wikidata-build', u'language', u'wikifactmine', u'deployment-prep', u'puppet-ca-replacement', u'gerrit', u'dumps', u'librarybase', u'testproject', u'bots', u'etherpad', u'ores'])

Mentioned in SAL (#wikimedia-releng) [2017-07-21T14:12:44Z] <hashar> added novaadmin to deployment-prep as a regular user. That lets MediaWiki OpenStack API list the instances T171280

$ curl 'https://wikitech.wikimedia.org/w/api.php?action=query&list=novainstances&niregion=eqiad&format=json&niproject=deployment-prep' | jq .query.novainstances[0]
{
  "name": "deployment-eventlog02",
  "state": "ACTIVE",
  "ip": [
    "10.68.18.138"
  ],
  "id": null,
  "floatingip": [],
  "securitygroups": [
    "default"
  ],
  "imageid": "b93943b2-d8e5-48ad-b80d-a6778629d4a6"
}

The Icinga alert should probably be more noisy. Left to figure out is whether novaadmin should actually be a member.

Andrew added a subscriber: Andrew.Jul 21 2017, 2:26 PM

There was a brief period when novaadmin couldn't log in, is it possible you just caught it at a bad moment? The above curl seems ok to me now.

Yup because I have added novaadmin as a member of the deployment-prep tenant. But for tools it is still empty:

$ curl 'https://wikitech.wikimedia.org/w/api.php?action=query&list=novainstances&niregion=eqiad&format=json&niproject=deployment-prep' 
{"batchcomplete":"","query":{"novainstances":[]}}

In labnet nova logs there is long history of those status 401 error when trying to reach /server/details . So most probably it is a long standing issue.

fgiunchedi added a comment.EditedJul 21 2017, 2:38 PM
There was a brief period when novaadmin couldn't log in, is it possible you just caught it at a bad moment?  The above curl seems ok to me now.

It looks like this stopped working yesterday at 19:20 UTC and resumed for beta when @hashar restored the access, at around 14:20 UTC

See also the gap in metrics here: https://grafana-labs.wikimedia.org/dashboard/db/project-health?orgId=1&from=1500573898444&to=1500647815316

Tools looks like is still broken though: https://grafana-labs.wikimedia.org/dashboard/db/project-health?orgId=1&from=1500644109341&to=1500647979194&var-datasource=Tools%20Prometheus&var-instance=All

I just can't think of any reason why those roles would've been removed :( investigating

Andrew closed this task as Resolved.Jul 21 2017, 10:16 PM
Andrew claimed this task.

I have a fix to prevent this from happening again... in the meantime I've added novaadmin back to everything.