Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Gerrit: add basic robots.txt for proxy | operations/puppet | production | +9 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | thcipriani | T209456 Gerrit is down "502 Proxy Error" | |||
Resolved | thcipriani | T209526 Add javamelody to gerrit |
Event Timeline
I was still quite asleep, but I saw a series of broken pipes from sockets and jetty refusing to manage any new connection in the logs, so I just restarted gerrit. It is now working, so we can lower the priority of this ticket to "High" but the correlation with the new release seems clear to me.
[21:45:43] <legoktm> I didn't realize it was upgraded today, I had been encountering weird behavior a few hours ago that I wasn't sure about
...
[21:58:11] <legoktm> I couldn't get https://gerrit.wikimedia.org/r/admin/groups/11 to load (works now)
[21:58:18] <legoktm> also libraryupgrader wasn't able to push patches
[21:59:32] <legoktm> libraryupgrader seems to be working now though
That might be related to Gerrit 2.15.6 upgrade T205784 . I am not familiar with Jetty though but we can at least dig in the logs on the cobalt server.
We had a lot of stack overflow errors on random pages such as gittiles, that is due to some pages being very long to prettify. Then at 5:36:43 UTC we had a lot of:
[2018-11-14 05:36:43,643] [HTTP-8409] WARN org.eclipse.jetty.util.thread.QueuedThreadPool : HTTP{STARTED,10<=60<=60,i=0,q=500} rejected org.eclipse.jetty.io.ManagedSelector$CreateEndPoint@3e829144 [2018-11-14 05:36:43,643] [HTTP-8409] WARN org.eclipse.jetty.util.thread.strategy.ExecutingExecutionStrategy : Rejected execution of org.eclipse.jetty.io.ManagedSelector$CreateEndPoint@3e829144 [2018-11-14 05:36:43,666] [HTTP-8401] WARN org.eclipse.jetty.util.thread.QueuedThreadPool : HTTP{STARTED,10<=60<=60,i=0,q=500} rejected org.eclipse.jetty.io.ManagedSelector$CreateEndPoint@4a74a19e [2018-11-14 05:36:43,669] [HTTP-8401] WARN org.eclipse.jetty.util.thread.strategy.ExecutingExecutionStrategy : Rejected execution of org.eclipse.jetty.io.ManagedSelector$CreateEndPoint@4a74a19e
Until it got shutdown:
[2018-11-14 05:46:58,742] [ShutdownCallback] INFO org.eclipse.jetty.server.AbstractConnector : Stopped ServerConnector@55bafbc1{HTTP/1.1,[http/1.1]}{127.0.0.1:8080} [2018-11-14 05:46:58,752] [ShutdownCallback] INFO org.eclipse.jetty.server.handler.ContextHandler : Stopped o.e.j.s.ServletContextHandler@487ee2a{/r,null,UNAVAILABLE}
Then followed by a lot of:
[2018-11-14 05:47:28,753] [ShutdownCallback] WARN org.eclipse.jetty.util.thread.QueuedThreadPool : HTTP{STOPPING,10<=60<=60,i=0,q=196} Couldn't stop Thread[HTTP-8315,5,main]
Seems to me a web crawler has hit Gerrit / gittiles and ended up exhausting Jetty thread pool. We have some logs in Apache2 though. By user agents:
grep tiles /var/log/apache2/gerrit.wikimedia.org.https.access.log.1|cut -f 12|sort|uniq -c|sort -n|tail -n4 1963 Mozilla/5.0 (compatible; SputnikBot/2.3; +http://corp.sputnik.ru/webmaster) 3517 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) 3710 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 22448 Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)
Those happened with some frequency in the past
Seems to me a web crawler has hit Gerrit / gittiles and ended up exhausting Jetty thread pool. We have some logs in Apache2 though. By user agents:
grep tiles /var/log/apache2/gerrit.wikimedia.org.https.access.log.1|cut -f 12|sort|uniq -c|sort -n|tail -n4 1963 Mozilla/5.0 (compatible; SputnikBot/2.3; +http://corp.sputnik.ru/webmaster) 3517 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) 3710 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 22448 Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)
For the short term we could add a robots.txt for gerrit (which we have for https://gerrit.wikimedia.org/r/robots.txt but not for https://gerrit.wikimedia.org/g/robots.txt ), but I don't know if that is the core of the problem.
I am also unfamiliar with jetty -- do we have any monitoring for that? I know there are plugins for gerrit to send dropwizard metrics to various dashboards, but I don't know if we already have something setup?
The interesting thing I found was that we started serving 5xx around 3:59AM which jives with a drop in CPU on grafana for cobalt: https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=cobalt&var-network=eth0&from=1542159111813&to=1542193696517&orgId=1
Seems like all threads in the pool were busy doing something and started denying new processes. sshd seemed fine during this period FWIW. It's hard to know what was tying up the server without greater insight into jetty.
So the problem seems to have started between 02:00 and 02:12 UTC. There was a fairly large spike in outgoing traffic on eth0 between 02:10 and 02:12 at which point cpu load gradually falls off as a gradually increasing proportion of requests are met with 5xx errors.
@thcipriani is going to try installing https://gerrit-review.googlesource.com/admin/repos/plugins/javamelody to hopefully collect some more useful data about the state of the JVM. At this point we are operating under the assumption that there is likely to be a new bug in gerrit 2.15 since gerrit was just upgraded yesterday. If that is the case then we need to collect more useful information to track down the cause. It's either that or cosmic rays twiddled our bits.
For monitoring it seems we monitor the JVM with JMX. There is a task for Gerrit at: T184086
For the CI Jenkins, Melody interface is at https://integration.wikimedia.org/ci/monitoring
Change 473638 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Gerrit: add basic robots.txt for proxy
Change 473638 merged by Giuseppe Lavagetto:
[operations/puppet@production] Gerrit: add basic robots.txt for proxy
Two days later -- after adding in some better monitoring and blocking bots from indexing git blame for giant files with a lot of history -- we seem to be in a stable place.
I'm going to close this as resolved.