Page MenuHomePhabricator
Paste P8313

gerrit crash 20190329
ActivePublic

Authored by Dzahn on Mar 29 2019, 12:16 PM.
Tags
None
Referenced Files
F28508845: raw.txt
Mar 29 2019, 12:23 PM
F28508800: raw.txt
Mar 29 2019, 12:16 PM
problem starts, icinga and users notice:
07:37 < _joe_> fatal: unable to access 'https://gerrit.wikimedia.org/r/operations/puppet/': The requested URL returned error: 502
07:38 < _joe_> ok gerrit seems to be down again
07:38 < arturo> https://www.irccloud.com/pastebin/OYOfqnzN/
07:38 < marostegui> confirmed down yep
--
07:39 <+icinga-wm> PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
07:40 <+icinga-wm> PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring
analysis:
07:38 < _joe_> and it was determined this is an issue with the latest version AIUI?
07:43 < mutante> status is active (running). error log shows something about NoteDbBatchUpdate.execute. did not execute anything. waiting
07:44 < mutante> Caused by: com.google.gerrit.server.git.LockFailureException: Update aborted with one or more lock failures: ?
07:46 < _joe_> [2019-03-29 11:46:12,388] [ShutdownCallback] WARN org.eclipse.jetty.util.thread.QueuedThreadPool : HTTP{STOPPING,10<=60<=60,i=0,q=58} Couldn't stop Thread[HTTP-8495,5,main]
07:46 < _joe_> hundreds of these
07:47 < _joe_> org.eclipse.jgit.errors.ConfigInvalidException: Invalid external ID config for note 'da39a3ee5e6b4b0d3255bfef95601890afd80709': Expected exactly 1 'externalId' section, found 0
restarting service:
07:44 < _joe_> mutante: restart it, it's the same error as tonight
07:45 < mutante> !log cobalt - systemctl restart gerrit
07:47 < mutante> i ran systemctl restart gerrit and finally it finished
07:47 < mutante> after quite some waiting
07:47 < _joe_> mutante: hanging threads that wouldn't terminate
recovery:
07:47 < mutante> gerrit back for me
07:47 <+icinga-wm> RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 950 bytes in 0.090 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus
outcomes:
- improved docs:
07:44 < marostegui> mutante: Just asking, restarting gerrit is just like any other service restart, right, via systemctl, right?
07:44 < mutante> marostegui: yes, systemctl status gerrit etc
07:45 < marostegui> mutante: I will add that to https://wikitech.wikimedia.org/wiki/Gerrit just to have it more complete :)
- downgrade ?
07:46 < apergos> (02:10:29 πμ) thcipriani: I think I'm going to downgrade back to gerrit 2.15.11 tomorrow. I think there may be some kind of threadlock issue in the latest jgit version :\
questions:
07:45 < _joe_> can we downgrade gerrit?
07:45 < apergos> it is planned for today iirc
07:47 < _joe_> apergos: yeah, I'm not convinced this isn't creating all kinds of inconsistencies in our git repos