Page MenuHomePhabricator

More verbose messages from service-checker-swagger
Closed, ResolvedPublic

Description

Today service-checker-swagger failed on all restbase nodes in icinga with

/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage responds with malformed body: 'NoneType' object has no attribute 'get'

There should be a debug/verbose mode to further understand what service-checker-swagger is doing (e.g. requested URL and response from the service) to ease debugging such as this. Turns out in this case the root cause was vandalism on Barak Obama deleting the image and therefore the thumbnail wasn't there (thanks @Pchelolo !)

Related incident: https://wikitech.wikimedia.org/wiki/Incident_documentation/2017-09-22_REST-API-alert-pageimage-update

Event Timeline

At first look looks like it raised the exception in swagger.py:255 calling self._check_json_chunk(data, self.body) that is recursive.

And I think the underlying failure is at line 329, at some point in the recursion k was not anymore in data and d was None so at the next round it tried to execute get() on None:

d = data.get(k, None)
self._check_json_chunk(d, v, prefix=p)

I'm not familiar with the tool but if you (@fgiunchedi) think I should take care of it (due to the SRE-tools tag) just let me know 😉

Thanks @Volans for taking a look! If you have time I think that'd be great to have fixed, OTOH it is one of "nice to have" things more than anything else. Also I don't really know how much work it'd be :(

Change 321714 had a related patch set uploaded (by Volans):
add additional information on malformed responses

https://gerrit.wikimedia.org/r/321714

Volans moved this task from In Progress to In Code Review on the SRE-tools board.
Volans triaged this task as Medium priority.Nov 22 2016, 12:02 PM

Change 321714 merged by Giuseppe Lavagetto:
add additional information on malformed responses

https://gerrit.wikimedia.org/r/321714

There was a reoccurence of this today with mobileapps

scb1001:~$ /usr/bin/service-checker-swagger  -t 5 10.64.0.16 http://10.64.0.16:8888
/{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: 'NoneType' object has no attribute 'get'

BTW, we're going to remove that problematic endpoint today since we don't need it anymore. T152135: Remove obsolete mobile-summary endpoint in MCS

I took another look at this and it seems related to an edit since no deploys happened. Despite the fact that mobile-summary is deprecated I suspect edits breaking swagger specifications might happen regardless :(

The issue is not directly in Swagger. It's just that a Swagger spec is used to run a test which exercises the endpoint. The issue is probably in the code of this endpoint, which was not well maintained. So, it's good to get rid of it.

This happened to me today:

mobrovac@xenon:~$ check-restbase 
Generic error: 'NoneType' object has no attribute '__getitem__'

It turned out the example spec had wrong indentation (cf PR #724), but it took me a while to figure that out. It would be nice to have more descriptive errors. In this instance, the response object was inside the request object, so a message of the type Error for <example-title>: No reponse object found would have saved me a lot of debugging time.

The issue is not directly in Swagger. It's just that a Swagger spec is used to run a test which exercises the endpoint. The issue is probably in the code of this endpoint, which was not well maintained. So, it's good to get rid of it.

What I meant is that user edits breaking what the spec expects will be always bound to happen, regardless of the code. Otherwise I can't explain why the check would fail and recover by itself without deployment.

Same thing happened today for one of the mobile-apps checks:

/{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: 'NoneType' object has no attribute '__getitem__'):
{}

It could also be helpful in following up on those issues if the script could output the request and response sections from the x-ample in an event of failure. Would be much easier to understand what exactly have happened. The fact that the Cat page was referenced in the error message is because the test case is properly named. Many other test cases don't mention the page name, so dumping the request/response pair would be very helpful.

Even more neat feature would be to convert the request to a curl statement right away, so that you could repeat the request manually in no time.

I'm gonna move the ticket from Done back to Backlog if you don't mind.

This comment was removed by elukey.

Since icinga space for reporting the check output is limited anyway I think in this case additional information / debugging should be written to syslog too, including the curl suggestion from @Pchelolo

[ ... ]

Even more neat feature would be to convert the request to a curl statement right away, so that you could repeat the request manually in no time.

+1

We've had an instance of this vandalism-related alert again and the service-checker-swagger wasn't very descriptive again, it posted AttributeError: NoneType object has no attribute get

https://wikitech.wikimedia.org/wiki/Incident_documentation/20170922-REST-API-alert-pageimage-update

Change 386116 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/software/service-checker@master] [WIP] Improve the checking procedure and emit better messages

https://gerrit.wikimedia.org/r/386116

Change 386116 merged by Giuseppe Lavagetto:
[operations/software/service-checker@master] Improve the checking procedure and emit better messages; v0.1.4

https://gerrit.wikimedia.org/r/386116

Removing SRE-tools for now, feel free to re-add if you need anything from us on this.

Krinkle updated the task description. (Show Details)
Krinkle removed a subscriber: bearND.
Krinkle subscribed.

@Joe Can this be resolved, or is there more to be done? (Haven't looked, triaging a large number of tasks. asking you as the one last active on the patch.)