Page MenuHomePhabricator

statograph_post service fail on alert hosts
Closed, ResolvedPublic

Description

statograph has been failing on alert1001 Friday, this is the first failure (ongoing)

Sep  3 23:28:08 alert1001 systemd[1]: Started Runs statograph to publish data to statuspage.io.
Sep  3 23:28:09 alert1001 systemd[1]: statograph_post.service: Succeeded.
Sep  3 23:29:01 alert1001 systemd[1]: Started Runs statograph to publish data to statuspage.io.
Sep  3 23:29:04 alert1001 statograph[796]: ERROR:statograph.statuspage:Error response from Statuspage: {"error":"undefined method `new' for BigDecimal:Class"}
Sep  3 23:29:04 alert1001 statograph[796]: ERROR:statograph.__main__:Exception encountered when processing h5mvbny28713; continuing
Sep  3 23:29:04 alert1001 statograph[796]: Traceback (most recent call last):
Sep  3 23:29:04 alert1001 statograph[796]:   File "/usr/lib/python3/dist-packages/statograph/__main__.py", line 116, in main
Sep  3 23:29:04 alert1001 statograph[796]:     base_url, s)
Sep  3 23:29:04 alert1001 statograph[796]:   File "/usr/lib/python3/dist-packages/statograph/uploader.py", line 86, in upload_one_batch
Sep  3 23:29:04 alert1001 statograph[796]:     statuspage.add_datapoints({self.metric.id: data})
Sep  3 23:29:04 alert1001 statograph[796]:   File "/usr/lib/python3/dist-packages/statograph/statuspage.py", line 119, in add_datapoints
Sep  3 23:29:04 alert1001 statograph[796]:     return self._request('POST', self._url("metrics/data"), json=body)
Sep  3 23:29:04 alert1001 statograph[796]:   File "/usr/lib/python3/dist-packages/statograph/statuspage.py", line 78, in _request
Sep  3 23:29:04 alert1001 statograph[796]:     resp.raise_for_status()
Sep  3 23:29:04 alert1001 statograph[796]:   File "/usr/lib/python3/dist-packages/requests/models.py", line 940, in raise_for_status
Sep  3 23:29:04 alert1001 statograph[796]:     raise HTTPError(http_error_msg, response=self)
Sep  3 23:29:04 alert1001 statograph[796]: requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api.statuspage.io/v1/pages/nnqjzz7cd4tj/metrics/data
Sep  3 23:29:04 alert1001 systemd[1]: statograph_post.service: Main process exited, code=exited, status=1/FAILURE
Sep  3 23:29:04 alert1001 systemd[1]: statograph_post.service: Failed with result 'exit-code'.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-09-06T12:06:03Z] <godog> silence statograph until thurs on alert1001 - T290425

Thanks for opening this. I am not an expert on this at all, but was involved in the deployment so had a little look.

The errors are odd, I've tested here from my laptop and get the same thing. But it seems to be inconsistent, sometimes the upload gets a 202 response, sometimes a 400 as seen in the systemd output. Example output from running statograph manually (in verbose mode, and after adjusting the code very slightly to only upload 1 datapoint per metric to keep output sane):

send: b'POST /v1/pages/<page id>/metrics/data HTTP/1.1\r\nHost: api.statuspage.io\r\nUser-Agent: statograph/0.1.0 (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/statograph) python-requests/2.22.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\nAuthorization: OAuth <api key>\r\nContent-Length: 84\r\nContent-Type: application/json\r\n\r\n'
send: b'{"data": {"<err response metric>": [{"timestamp": 1631018040, "value": 0.2583333333333333}]}}'
reply: 'HTTP/1.1 202 Accepted\r\n'
send: b'POST /v1/pages/<page id>/metrics/data HTTP/1.1\r\nHost: api.statuspage.io\r\nUser-Agent: statograph/0.1.0 (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/statograph) python-requests/2.22.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\nAuthorization: OAuth <api key>\r\nContent-Length: 85\r\nContent-Type: application/json\r\n\r\n'
send: b'{"data": {"<err response metric>": [{"timestamp": 1631018280, "value": 0.17083333333333334}]}}'
reply: 'HTTP/1.1 400 Bad Request\r\n'

I also get the 400 response / same error message using curl:

cmooney@wikilap:~/statograph_test$ curl -v https://api.statuspage.io/v1/pages/<page id>/metrics/data \
>   --header "Authorization: OAuth <api key>" \
>   --header "Content-Type: application/json" \
>   --request POST \
>   --data '{"data": {"<err response metric>": [{"timestamp": 1631019360, "value": 0.28750000000000003}]}}'
{"error":"undefined method `new' for BigDecimal:Class"}

So my feeling is that Statograph itself is ok. I can't see that the API has changed or what it's sending is invalid. Given Chris is still out I will try to find our login details for the statuspage portal and see if I can see anything else there, or open a ticket with them to see if they can spot what is wrong.

@cmooney thanks for looking into this! I'm no Java expert but the reference to BigDecimal:Class seems to be a Java one from their backend (see https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html for example).
I noticed that the failed points have a decimal more than the others:

# failing above
0.28750000000000003
0.17083333333333334
# success above
0.2583333333333333

So I'm wondering if we're just going over some precision limit on this data structure. If that would be the case a simple fix would be to round to the max allowed decimals, I don't think we need this precision anyway.

@cmooney thanks for looking into this! I'm no Java expert but the reference to BigDecimal:Class seems to be a Java one from their backend (see https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html for example).

As suggested on IRC by @Majavah this could be as well Ruby. From a random search on the internet: https://stackoverflow.com/a/60491254 seems a possible match. In that case it could also be some misconfiguration on the backends of the API, where some work fine and some fail. Or still depends on the precision, and this failing path is called only when the precision is over some limit.

@cmooney given that you have already it setup locally, could you try to pass rounded values and see if it ever fails?

Hey @Volans nice catch! Let me see how it goes with rounded values.

Spot on @Volans works fine now:

cmooney@wikilap:~/statograph_test$ statograph -v -c config.yaml upload_metrics
INFO:statograph.uploader:Querying data for metric name 'Error responses' from Tue, 07 Sep 2021 14:29:00 +0000 to Tue, 07 Sep 2021 14:31:32 +0000
INFO:statograph.statuspage:Uploading 3 datapoints
INFO:statograph.uploader:Querying data for metric name 'Successful edits' from Tue, 07 Sep 2021 14:30:00 +0000 to Tue, 07 Sep 2021 14:31:34 +0000
INFO:statograph.statuspage:Uploading 1 datapoints
INFO:statograph.uploader:Querying data for metric name 'Wiki response time' from Tue, 07 Sep 2021 14:30:00 +0000 to Tue, 07 Sep 2021 14:31:35 +0000
INFO:statograph.statuspage:Uploading 2 datapoints
INFO:statograph.uploader:Querying data for metric name 'Total request volume' from Mon, 06 Sep 2021 08:17:00 +0000 to Tue, 07 Sep 2021 09:17:00 +0000
INFO:statograph.statuspage:Uploading 1501 datapoints

Change 719281 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/software/statograph@master] Added __post_init__ function to Datapoint class to round values to 9 decimal places. This is required to avoid apparent limit on what the statuspage.io API will accept.

https://gerrit.wikimedia.org/r/719281

Change 719281 merged by jenkins-bot:

[operations/software/statograph@master] Round float values to a fixed precision

https://gerrit.wikimedia.org/r/719281

Change 719473 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/software/statograph@master] 0.1.2: prepare release

https://gerrit.wikimedia.org/r/719473

Change 719473 merged by Jbond:

[operations/software/statograph@master] 0.1.2: prepare release

https://gerrit.wikimedia.org/r/719473

jbond assigned this task to cmooney.

I have deployed @cmooney fix will resolve

Because the only data available in the API is the timestamp of the most recent successfully-processed datapoint, this unfortunately left a gap in the data for two metrics:

Screenshot_20210914_155619.png (905×796 px, 85 KB)

This is pretty unavoidable when an issue like this happens given the current logic, but it is easy to fix manually (and with some more work, we could probably detect it automatically).

I'm going to clear Statuspage's history for the "Error responses" and the "Wiki response time" graphs, which will cause statograph to automatically re-upload them as it runs normally. This will mean more data is temporarily missing for users, but only temporarily. (If we wanted to prevent this, we could create new, hidden metrics, add them to statograph's configuration, and once uploads had finished, make them visible and delete the two old metrics.)

I'll also make sure to document how to do this, should it ever be necessary again.

Mentioned in SAL (#wikimedia-operations) [2021-09-14T20:06:36Z] <cdanis> T290425 ✔️ cdanis@alert1001.wikimedia.org ~ 🕓🍵 sudo /usr/bin/statograph -c /etc/statograph/config.yml erase_metric_data h5mvbny28713

Mentioned in SAL (#wikimedia-operations) [2021-09-14T20:06:51Z] <cdanis> T290425 ✔️ cdanis@alert1001.wikimedia.org ~ 🕓🍵 sudo /usr/bin/statograph -c /etc/statograph/config.yml erase_metric_data lyfcttm2lhw4