Page MenuHomePhabricator

WebPageTest and WebPageReplay data unreachable in Grafana
Closed, ResolvedPublic

Description

Accessing https://grafana.wikimedia.org/d/000000210/webpagetest?orgId=1 all graphs gets a 503. All graphs form our proxied Graphite instance gets a 503.

Screen Shot 2020-03-13 at 8.21.46 AM.png (894×1 px, 137 KB)

The console gives me:

Screen Shot 2020-03-13 at 8.21.29 AM.png (362×2 px, 125 KB)

I logged into the Graphite instance and everything looks ok in the log. By ok I mean I could see that metrics is coming in from our instances and I couldn't see any errors in the logs.

My guess is that something has happened with how we proxy the instance, @CDanis do you know if something changed that last day with that?

Event Timeline

Peter added a subscriber: dpifke.

Also including @dpifke : I think I've missed talking about our own Graphite instance and how its setup?

Mentioned in SAL (#wikimedia-operations) [2020-03-13T14:57:24Z] <cdanis> T247586 ✔️ cdanis@grafana1002.eqiad.wmnet ~ 🕥☕ sudo systemctl restart apache2.service

A guess for what changed: the webproxy servers in eqiad and codfw were recently replaced with new servers, in eqiad's case install1003.wikimedia.org.

I don't think Apache knows to re-resolve a CNAME like webproxy.eqiad.wmnet-- docs for mod_proxy seem to indicate name resolution happens once per worker process -- so I did a restart, but that still wasn't sufficient to get it to work.

Now it looks like that the AWS host is only accepting connections from install1002, and not from install1003.

It might make sense to white-list our entire IP space (or all of eqiad and codfw?) for future-proofing.

Thank you @CDanis , I have open up for install03 for now and then I can open up for our IP space next week (what is the correct IPs?).

Thank you @CDanis !

After I changed the blocking our alerts started to fire for timing out:

Screen Shot 2020-03-15 at 3.04.53 PM.png (320×606 px, 69 KB)

Also using Grafana against that Graphite instance, the graphs takes longer time to load (at least that is my feeling). Was there another change? I'm pretty sure that started to happened the same time as the change to install003.

Ah forget that. When I open up for our range it started to work as before :)

Re-opening, alerts are still timing out as they didn't do before. Also can see that the graphs loads slower than before.

I temporarily opened up to make sure nothing was wrong on the AWS side but still get those timeouts. I can see that some data comes through but not everything, every time. @CDanis can you see something on the proxy side?

Seems like there are some as-yet undiagnosed issues with the squid proxy in eqiad.

For now since it isn't much traffic I think I'm going to repoint grafana cross-cluster to the codfw proxy.

CDanis added a subscriber: fgiunchedi.

OK, @fgiunchedi fixed the squid issue on install1003 (thanks!). The WebPageTest alerts dashboard feels fast now. I think this is resolved, please reopen if not