We have an alert for RRDP status alert, where in grafana we get https://rrdp.ripe.net/notification.xml=-1
Description
Related Objects
Event Timeline
I think this means that the query to that URL times out.
As it completes properly from codfw I'm wondering if it's not an issue with the webproxies (overloaded or similar).
Any idea who can help looking into it?
akosiaris@cumin2001:~$ curl -x webproxy.codfw.wmnet:8080 https://rrdp.ripe.net/notification.xml=-1 <html> <head><title>404 Not Found</title></head> <body bgcolor="white"> <center><h1>404 Not Found</h1></center> <hr><center>nginx</center> </body> </html> akosiaris@cumin2001:~$ curl -x webproxy.eqiad.wmnet:8080 https://rrdp.ripe.net/notification.xml=-1 <html> <head><title>404 Not Found</title></head> <body bgcolor="white"> <center><h1>404 Not Found</h1></center> <hr><center>nginx</center> </body> </html>
This 404s as well for me currently. So I guess the webproxies are fine?
It's because Grafana reports Routinator pulling data from https://rrdp.ripe.net/notification.xml as a -1 on its graph. Where I think -1 means timeout.
So the correct URL is https://rrdp.ripe.net/notification.xml indeed.
See https://grafana.wikimedia.org/d/UwUa77GZk/rpki?orgId=1&from=now-24h&to=now&fullscreen&panelId=56
vs. codfw:
https://grafana.wikimedia.org/d/UwUa77GZk/rpki?orgId=1&from=now-24h&to=now&fullscreen&panelId=55
As it keeps flapping I (temporarily) disabled the alert in eqiad, and we can rely in codfw if there is an actual issue.
- Routinator upgraded in T252010. Which helped to remove the "dubious" targets.
- Since this task has been opened, proxies have been moved to new hosts and performance has increase
- Alerting has been tuned to only trigger on HTTP code > 399, as it's not possible to control the repositories we connect to, they will always be a risk of alert.