Page MenuHomePhabricator

thanos u/i gives errors if left idle for a few hours
Open, MediumPublic

Description

Example url: https://thanos.wikimedia.org/graph?g0.range_input=2d&g0.max_source_resolution=0s&g0.expr=sum_over_time(mysql_exporter_last_scrape_error%5B5m%5D)%20%3E%201&g0.tab=1

Steps to produce:

  • Open the url in a tab, execute the query, see that it works
  • Leave it for a few hours
  • Click 'execute' again

Expected result:

  • Results get refreshed

Actual result:

  • Thanos displays Error executing query: error.
  • Browser log shows a 403 from idp
https://idp.wikimedia.org/login?service=https://thanos.wikimedia.org/api/v1/query?query=sum_over_time(mysql_exporter_last_scrape_error%5B5m%5D)%20%3E%201&dedup=true&partial_response=false&time=1605794115.998&_=1605784494152
Status403
Forbidden
VersionHTTP/1.1
Transferred235 B (0 B size)

    	
    HTTP/1.1 403 Forbidden

    vary: Origin,Access-Control-Request-Method,Access-Control-Request-Headers

    date: Thu, 19 Nov 2020 13:55:16 GMT

    x-envoy-upstream-service-time: 3

    server: envoy

    transfer-encoding: chunked
    	
    Accept
    	*/*
    Accept-Encoding
    	gzip, deflate, br
    Accept-Language
    	en-GB,en;q=0.5
    Access-Control-Request-Headers
    	x-requested-with
    Access-Control-Request-Method
    	GET
    Connection
    	keep-alive
    DNT
    	1
    Host
    	idp.wikimedia.org
    Origin
    	https://thanos.wikimedia.org
    Referer
    	https://thanos.wikimedia.org/graph?g0.range_input=2d&g0.max_source_resolution=0s&g0.expr=sum_over_time(mysql_exporter_last_scrape_error%5B5m%5D)%20%3E%201&g0.tab=1
    User-Agent
    	Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0

Event Timeline

Just reproduced this in chrome, and got this message:

Access to XMLHttpRequest at 'https://idp.wikimedia.org/login?service=https%3a%2f%2fthanos.wikimedia.org%2fapi%2fv1%2fquery%3fquery%3dsum_over_time(mysql_exporter_last_scrape_error%255B5m%255D)%2520%253E%25201%26dedup%3dtrue%26partial_response%3dtrue%26time%3d1605799406.496%26_%3d1605795670420' (redirected from 'https://thanos.wikimedia.org/api/v1/query?query=sum_over_time(mysql_exporter_last_scrape_error%5B5m%5D)%20%3E%201&dedup=true&partial_response=true&time=1605799406.496&_=1605795670420') from origin 'https://thanos.wikimedia.org' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource.

Change 642020 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] O:idp: add https://thanos.wikimedia.org to cors acl

https://gerrit.wikimedia.org/r/642020

Change 642020 merged by Jbond:
[operations/puppet@production] O:idp: add https://thanos.wikimedia.org to cors acl

https://gerrit.wikimedia.org/r/642020

Just reproduced this in chrome, and got this message:

Access to XMLHttpRequest at 'https://idp.wikimedia.org/login?service=https%3a%2f%2fthanos.wikimedia.org%2fapi%2fv1%2fquery%3fquery%3dsum_over_time(mysql_exporter_last_scrape_error%255B5m%255D)%2520%253E%25201%26dedup%3dtrue%26partial_response%3dtrue%26time%3d1605799406.496%26_%3d1605795670420' (redirected from 'https://thanos.wikimedia.org/api/v1/query?query=sum_over_time(mysql_exporter_last_scrape_error%5B5m%5D)%20%3E%201&dedup=true&partial_response=true&time=1605799406.496&_=1605795670420') from origin 'https://thanos.wikimedia.org' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource.

Ok i did wonder if chrome was reporting the error a bit better, good to know. I have deployed the change above to add thanos to the list of Origins allowed to do CORS, hopefully this will fix the issue

herron triaged this task as Medium priority.Nov 20 2020, 3:10 PM

This is still occurring for me. From firefox:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://idp.wikimedia.org/login?service=https%3a%2f%2fthanos.wikimedia.org%2fapi%2fv1%2fquery%3fquery%3dsum_over_time(mysql_exporter_last_scrape_error%255B5m%255D)%2520%253E%25201%26dedup%3dtrue%26partial_response%3dfalse%26time%3d1606292354.478%26_%3d1606228074853. (Reason: CORS preflight response did not succeed).

From chrome:

Access to XMLHttpRequest at 'https://idp.wikimedia.org/login?service=https%3a%2f%2fthanos.wikimedia.org%2fapi%2fv1%2fquery%3fquery%3dsum_over_time(mysql_exporter_last_scrape_error%255B5m%255D)%2520%253E%25201%26dedup%3dtrue%26partial_response%3dtrue%26time%3d1606292667.372%26_%3d1606237352855' (redirected from 'https://thanos.wikimedia.org/api/v1/query?query=sum_over_time(mysql_exporter_last_scrape_error%5B5m%5D)%20%3E%201&dedup=true&partial_response=true&time=1606292667.372&_=1606237352855') from origin 'https://thanos.wikimedia.org' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: It does not have HTTP ok status.
graph.js?v=non-git:613 Uncaught TypeError: Cannot read property 'data' of undefined
    at Object.complete (graph.js?v=non-git:613)
    at c (jquery-3.5.0.min.js?v=non-git:2)
    at Object.fireWith (jquery-3.5.0.min.js?v=non-git:2)
    at l (jquery-3.5.0.min.js?v=non-git:2)
    at XMLHttpRequest.<anonymous> (jquery-3.5.0.min.js?v=non-git:2)
complete @ graph.js?v=non-git:613
c @ jquery-3.5.0.min.js?v=non-git:2
fireWith @ jquery-3.5.0.min.js?v=non-git:2
l @ jquery-3.5.0.min.js?v=non-git:2
(anonymous) @ jquery-3.5.0.min.js?v=non-git:2
error (async)
send @ jquery-3.5.0.min.js?v=non-git:2
ajax @ jquery-3.5.0.min.js?v=non-git:2
Prometheus.Graph.submitQuery @ graph.js?v=non-git:579
(anonymous) @ graph.js?v=non-git:261
dispatch @ jquery-3.5.0.min.js?v=non-git:2
v.handle @ jquery-3.5.0.min.js?v=non-git:2
jquery-3.5.0.min.js?v=non-git:2 GET https://idp.wikimedia.org/login?service=https%3a%2f%2fthanos.wikimedia.org%2fapi%2fv1%2fquery%3fquery%3dsum_over_time(mysql_exporter_last_scrape_error%255B5m%255D)%2520%253E%25201%26dedup%3dtrue%26partial_response%3dtrue%26time%3d1606292667.372%26_%3d1606237352855 net::ERR_FAILED
send @ jquery-3.5.0.min.js?v=non-git:2
ajax @ jquery-3.5.0.min.js?v=non-git:2
Prometheus.Graph.submitQuery @ graph.js?v=non-git:579
(anonymous) @ graph.js?v=non-git:261
dispatch @ jquery-3.5.0.min.js?v=non-git:2
v.handle @ jquery-3.5.0.min.js?v=non-git:2

im not that familiar with the error logging however net::ERR_FAILED looks like it may be a more generic network error then specificity with CORS which should have given a clear HTTP/1.1 403 Forbidden. I think the next thing to try is to trigger the pre-flight check from manually from the JavaScript console as it seems to work fine with curl*

$ $ curl -I -X OPTIONS   -H "Origin: https://thanos.wikimedia.org"   -H 'Access-Control-Request-Method: GET' https://idp.wikimedia.org/login\?service\=https%3a%2f%2fthanos.wikimedia.org%2fapi%2fv1%2fquery%3fquery%3dsum_over_time\(mysql_exporter_last_scrape_error%255B5m%255D\)%2520%253E%25201%26dedup%3dtrue%26partial_response%3dfalse%26time%3d1606292354.478%26_%3d1606228074853.
HTTP/1.1 200 OK
vary: Origin,Access-Control-Request-Method,Access-Control-Request-Headers
access-control-allow-origin: https://thanos.wikimedia.org
access-control-allow-methods: GET
access-control-allow-credentials: true
access-control-max-age: 3600
date: Wed, 25 Nov 2020 11:11:39 GMT
x-envoy-upstream-service-time: 3
server: envoy
transfer-encoding: chunked
  • I should say that when i first tested this with curl I got a 405 error for thanos and alerts. after failing over the idp and upgrading cas on idp1001 i now receive http 200 from both so its possible that there was a config mismatch which has now been resolved

I have not been able to recreate this, is this still causing an issue?

I'm still getting failures, but it's not clear where the issue is.

Firefox:

Chrome:

Do you get this error on all expressions, a specific expression or spasmodically? have also tagged observability in case there is something other then CORS in play

in fact observability is already tagged, @fgiunchedi wodner if this could be a more general issue?

Do you get this error on all expressions, a specific expression or spasmodically? have also tagged observability in case there is something other then CORS in play

I've only been trying with a single expression. It happens consistently once i've left the tab open for a few hours. (I.e. if i open the url, run the query, then leave it for a few hours, it will Always fail).

I've been running into this issue on Grafana as well, specifically on SSO session refresh the XHR issued by the browser start failing. Can we experiment with extending the session refresh? I'm not sure if there's any other mitigations available (reloading the page of course works)