Page MenuHomePhabricator

Gerrit is down "502 Proxy Error"
Closed, ResolvedPublic

Description

Screenshot_2018-11-13 502 Proxy Error.png (783×1 px, 17 KB)

Event Timeline

Legoktm triaged this task as Unbreak Now! priority.Nov 14 2018, 5:41 AM
Legoktm created this task.
Joe added a project: SRE.

I was still quite asleep, but I saw a series of broken pipes from sockets and jetty refusing to manage any new connection in the logs, so I just restarted gerrit. It is now working, so we can lower the priority of this ticket to "High" but the correlation with the new release seems clear to me.

Joe lowered the priority of this task from Unbreak Now! to High.Nov 14 2018, 5:49 AM

[21:45:43] <legoktm> I didn't realize it was upgraded today, I had been encountering weird behavior a few hours ago that I wasn't sure about
...
[21:58:11] <legoktm> I couldn't get https://gerrit.wikimedia.org/r/admin/groups/11 to load (works now)
[21:58:18] <legoktm> also libraryupgrader wasn't able to push patches
[21:59:32] <legoktm> libraryupgrader seems to be working now though

Joe removed Joe as the assignee of this task.Nov 14 2018, 8:23 AM
Joe subscribed.

That might be related to Gerrit 2.15.6 upgrade T205784 . I am not familiar with Jetty though but we can at least dig in the logs on the cobalt server.

We had a lot of stack overflow errors on random pages such as gittiles, that is due to some pages being very long to prettify. Then at 5:36:43 UTC we had a lot of:

[2018-11-14 05:36:43,643] [HTTP-8409] WARN  org.eclipse.jetty.util.thread.QueuedThreadPool : HTTP{STARTED,10<=60<=60,i=0,q=500} rejected org.eclipse.jetty.io.ManagedSelector$CreateEndPoint@3e829144
[2018-11-14 05:36:43,643] [HTTP-8409] WARN  org.eclipse.jetty.util.thread.strategy.ExecutingExecutionStrategy : Rejected execution of org.eclipse.jetty.io.ManagedSelector$CreateEndPoint@3e829144
[2018-11-14 05:36:43,666] [HTTP-8401] WARN  org.eclipse.jetty.util.thread.QueuedThreadPool : HTTP{STARTED,10<=60<=60,i=0,q=500} rejected org.eclipse.jetty.io.ManagedSelector$CreateEndPoint@4a74a19e
[2018-11-14 05:36:43,669] [HTTP-8401] WARN  org.eclipse.jetty.util.thread.strategy.ExecutingExecutionStrategy : Rejected execution of org.eclipse.jetty.io.ManagedSelector$CreateEndPoint@4a74a19e

Until it got shutdown:

[2018-11-14 05:46:58,742] [ShutdownCallback] INFO  org.eclipse.jetty.server.AbstractConnector : Stopped ServerConnector@55bafbc1{HTTP/1.1,[http/1.1]}{127.0.0.1:8080}
[2018-11-14 05:46:58,752] [ShutdownCallback] INFO  org.eclipse.jetty.server.handler.ContextHandler : Stopped o.e.j.s.ServletContextHandler@487ee2a{/r,null,UNAVAILABLE}

Then followed by a lot of:

[2018-11-14 05:47:28,753] [ShutdownCallback] WARN  org.eclipse.jetty.util.thread.QueuedThreadPool : HTTP{STOPPING,10<=60<=60,i=0,q=196} Couldn't stop Thread[HTTP-8315,5,main]

Seems to me a web crawler has hit Gerrit / gittiles and ended up exhausting Jetty thread pool. We have some logs in Apache2 though. By user agents:

grep  tiles /var/log/apache2/gerrit.wikimedia.org.https.access.log.1|cut -f 12|sort|uniq -c|sort -n|tail -n4
   1963 Mozilla/5.0 (compatible; SputnikBot/2.3; +http://corp.sputnik.ru/webmaster)
   3517 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
   3710 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
  22448 Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)

We had a lot of stack overflow errors on random pages such as gittiles, that is due to some pages being very long to prettify.

Those happened with some frequency in the past

Seems to me a web crawler has hit Gerrit / gittiles and ended up exhausting Jetty thread pool. We have some logs in Apache2 though. By user agents:

grep  tiles /var/log/apache2/gerrit.wikimedia.org.https.access.log.1|cut -f 12|sort|uniq -c|sort -n|tail -n4
   1963 Mozilla/5.0 (compatible; SputnikBot/2.3; +http://corp.sputnik.ru/webmaster)
   3517 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
   3710 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
  22448 Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)

For the short term we could add a robots.txt for gerrit (which we have for https://gerrit.wikimedia.org/r/robots.txt but not for https://gerrit.wikimedia.org/g/robots.txt ), but I don't know if that is the core of the problem.

I am also unfamiliar with jetty -- do we have any monitoring for that? I know there are plugins for gerrit to send dropwizard metrics to various dashboards, but I don't know if we already have something setup?

The interesting thing I found was that we started serving 5xx around 3:59AM which jives with a drop in CPU on grafana for cobalt: https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=cobalt&var-network=eth0&from=1542159111813&to=1542193696517&orgId=1

Seems like all threads in the pool were busy doing something and started denying new processes. sshd seemed fine during this period FWIW. It's hard to know what was tying up the server without greater insight into jetty.

So the problem seems to have started between 02:00 and 02:12 UTC. There was a fairly large spike in outgoing traffic on eth0 between 02:10 and 02:12 at which point cpu load gradually falls off as a gradually increasing proportion of requests are met with 5xx errors.

@thcipriani is going to try installing https://gerrit-review.googlesource.com/admin/repos/plugins/javamelody to hopefully collect some more useful data about the state of the JVM. At this point we are operating under the assumption that there is likely to be a new bug in gerrit 2.15 since gerrit was just upgraded yesterday. If that is the case then we need to collect more useful information to track down the cause. It's either that or cosmic rays twiddled our bits.

For monitoring it seems we monitor the JVM with JMX. There is a task for Gerrit at: T184086

For the CI Jenkins, Melody interface is at https://integration.wikimedia.org/ci/monitoring

Change 473638 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Gerrit: add basic robots.txt for proxy

https://gerrit.wikimedia.org/r/473638

Change 473638 merged by Giuseppe Lavagetto:
[operations/puppet@production] Gerrit: add basic robots.txt for proxy

https://gerrit.wikimedia.org/r/473638

thcipriani claimed this task.

Two days later -- after adding in some better monitoring and blocking bots from indexing git blame for giant files with a lot of history -- we seem to be in a stable place.

I'm going to close this as resolved.