Maniphest T209456

Gerrit is down "502 Proxy Error"
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Legoktm
	Nov 14 2018, 5:41 AM

Description

Screenshot_2018-11-13 502 Proxy Error.png (783×1 px, 17 KB)

Details

	Subject	Repo	Branch	Lines +/-
	Gerrit: add basic robots.txt for proxy	operations/puppet	production	+9 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		thcipriani	T209456 Gerrit is down "502 Proxy Error"
		Resolved		thcipriani	T209526 Add javamelody to gerrit

Event Timeline

Legoktm triaged this task as Unbreak Now! priority.Nov 14 2018, 5:41 AM

Legoktm created this task.

Restricted Application added subscribers: Liuxinyu970226, TerraCodes, Aklapper. · View Herald TranscriptNov 14 2018, 5:41 AM

Joe claimed this task.Nov 14 2018, 5:44 AM

Joe added a project: SRE.

I was still quite asleep, but I saw a series of broken pipes from sockets and jetty refusing to manage any new connection in the logs, so I just restarted gerrit. It is now working, so we can lower the priority of this ticket to "High" but the correlation with the new release seems clear to me.

Joe lowered the priority of this task from Unbreak Now! to High.Nov 14 2018, 5:49 AM

[21:45:43] <legoktm> I didn't realize it was upgraded today, I had been encountering weird behavior a few hours ago that I wasn't sure about
...
[21:58:11] <legoktm> I couldn't get https://gerrit.wikimedia.org/r/admin/groups/11 to load (works now)
[21:58:18] <legoktm> also libraryupgrader wasn't able to push patches
[21:59:32] <legoktm> libraryupgrader seems to be working now though

Joe removed Joe as the assignee of this task.Nov 14 2018, 8:23 AM

Joe subscribed.

That might be related to Gerrit 2.15.6 upgrade T205784 . I am not familiar with Jetty though but we can at least dig in the logs on the cobalt server.

We had a lot of stack overflow errors on random pages such as gittiles, that is due to some pages being very long to prettify. Then at 5:36:43 UTC we had a lot of:

[2018-11-14 05:36:43,643] [HTTP-8409] WARN  org.eclipse.jetty.util.thread.QueuedThreadPool : HTTP{STARTED,10<=60<=60,i=0,q=500} rejected org.eclipse.jetty.io.ManagedSelector$CreateEndPoint@3e829144
[2018-11-14 05:36:43,643] [HTTP-8409] WARN  org.eclipse.jetty.util.thread.strategy.ExecutingExecutionStrategy : Rejected execution of org.eclipse.jetty.io.ManagedSelector$CreateEndPoint@3e829144
[2018-11-14 05:36:43,666] [HTTP-8401] WARN  org.eclipse.jetty.util.thread.QueuedThreadPool : HTTP{STARTED,10<=60<=60,i=0,q=500} rejected org.eclipse.jetty.io.ManagedSelector$CreateEndPoint@4a74a19e
[2018-11-14 05:36:43,669] [HTTP-8401] WARN  org.eclipse.jetty.util.thread.strategy.ExecutingExecutionStrategy : Rejected execution of org.eclipse.jetty.io.ManagedSelector$CreateEndPoint@4a74a19e

Until it got shutdown:

[2018-11-14 05:46:58,742] [ShutdownCallback] INFO  org.eclipse.jetty.server.AbstractConnector : Stopped ServerConnector@55bafbc1{HTTP/1.1,[http/1.1]}{127.0.0.1:8080}
[2018-11-14 05:46:58,752] [ShutdownCallback] INFO  org.eclipse.jetty.server.handler.ContextHandler : Stopped o.e.j.s.ServletContextHandler@487ee2a{/r,null,UNAVAILABLE}

Then followed by a lot of:

[2018-11-14 05:47:28,753] [ShutdownCallback] WARN  org.eclipse.jetty.util.thread.QueuedThreadPool : HTTP{STOPPING,10<=60<=60,i=0,q=196} Couldn't stop Thread[HTTP-8315,5,main]

Seems to me a web crawler has hit Gerrit / gittiles and ended up exhausting Jetty thread pool. We have some logs in Apache2 though. By user agents:

grep  tiles /var/log/apache2/gerrit.wikimedia.org.https.access.log.1|cut -f 12|sort|uniq -c|sort -n|tail -n4
   1963 Mozilla/5.0 (compatible; SputnikBot/2.3; +http://corp.sputnik.ru/webmaster)
   3517 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
   3710 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
  22448 Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)

Paladox subscribed.Nov 14 2018, 11:23 AM

In T209456#4745707, @hashar wrote:

We had a lot of stack overflow errors on random pages such as gittiles, that is due to some pages being very long to prettify.

Those happened with some frequency in the past

Seems to me a web crawler has hit Gerrit / gittiles and ended up exhausting Jetty thread pool. We have some logs in Apache2 though. By user agents:

grep  tiles /var/log/apache2/gerrit.wikimedia.org.https.access.log.1|cut -f 12|sort|uniq -c|sort -n|tail -n4
   1963 Mozilla/5.0 (compatible; SputnikBot/2.3; +http://corp.sputnik.ru/webmaster)
   3517 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
   3710 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
  22448 Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)

For the short term we could add a robots.txt for gerrit (which we have for https://gerrit.wikimedia.org/r/robots.txt but not for https://gerrit.wikimedia.org/g/robots.txt ), but I don't know if that is the core of the problem.

I am also unfamiliar with jetty -- do we have any monitoring for that? I know there are plugins for gerrit to send dropwizard metrics to various dashboards, but I don't know if we already have something setup?

The interesting thing I found was that we started serving 5xx around 3:59AM which jives with a drop in CPU on grafana for cobalt: https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=cobalt&var-network=eth0&from=1542159111813&to=1542193696517&orgId=1

Seems like all threads in the pool were busy doing something and started denying new processes. sshd seemed fine during this period FWIW. It's hard to know what was tying up the server without greater insight into jetty.

So the problem seems to have started between 02:00 and 02:12 UTC. There was a fairly large spike in outgoing traffic on eth0 between 02:10 and 02:12 at which point cpu load gradually falls off as a gradually increasing proportion of requests are met with 5xx errors.

@thcipriani is going to try installing https://gerrit-review.googlesource.com/admin/repos/plugins/javamelody to hopefully collect some more useful data about the state of the JVM. At this point we are operating under the assumption that there is likely to be a new bug in gerrit 2.15 since gerrit was just upgraded yesterday. If that is the case then we need to collect more useful information to track down the cause. It's either that or cosmic rays twiddled our bits.

For monitoring it seems we monitor the JVM with JMX. There is a task for Gerrit at: T184086

For the CI Jenkins, Melody interface is at https://integration.wikimedia.org/ci/monitoring

thcipriani closed subtask T209526: Add javamelody to gerrit as Resolved.Nov 14 2018, 9:49 PM

Change 473638 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Gerrit: add basic robots.txt for proxy

https://gerrit.wikimedia.org/r/473638

gerritbot added a project: Patch-For-Review.Nov 14 2018, 11:33 PM

Change 473638 merged by Giuseppe Lavagetto:
[operations/puppet@production] Gerrit: add basic robots.txt for proxy

https://gerrit.wikimedia.org/r/473638

ArielGlenn removed a project: Patch-For-Review.Nov 16 2018, 11:07 AM

Two days later -- after adding in some better monitoring and blocking bots from indexing git blame for giant files with a lot of history -- we seem to be in a stable place.

I'm going to close this as resolved.

• mmodell awarded a token.Nov 16 2018, 5:49 PM

Liuxinyu970226 unsubscribed.Nov 19 2018, 1:38 PM

	F27215093: Screenshot_2018-11-13 502 Proxy Error.png
	Nov 14 2018, 5:41 AM

Gerrit is down "502 Proxy Error"Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Gerrit is down "502 Proxy Error"
Closed, ResolvedPublic
Actions

Related Objects
Search...