Page MenuHomePhabricator

Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working
Closed, ResolvedPublic

Description

Gerrit is failing to start on gerrit2001 since a restart of the service. It is not writing to logs, not starting sshd and failing to start.

(note this is not running gerrit.wikimedia.org yet, cobalt is running fine)

We tried

  • bin/gerrit.sh start
  • bin/gerrit.sh run
  • restarting the server.
  • Running init.

@demon ran init, and found it was stalling on the db. We then found it's because the firewall is preventing gerrit2001 from being allowed to connect. We need to fix the firewall and retry init again. (Requires DBA to fix the firewall)

Related Objects

StatusAssignedTask
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedNone
ResolvedRobH
ResolvedMarostegui
Resolvedjcrespo
ResolvedPapaul
ResolvedMarostegui
ResolvedRobH
ResolvedRobH
OpenNone
ResolvedPaladox
ResolvedPaladox
DeclinedNone
ResolvedPaladox
Resolvedhashar
Resolvedhashar
Resolvedhashar
ResolvedNone
ResolvedJoe
ResolvedJoe
ResolvedJdforrester-WMF
Resolvedbd808
Resolvedhashar
Resolvedhashar
Duplicatehashar
OpenNone
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Dzahn renamed this task from Gerrit is failing to start on gerrit2001 to Gerrit is failing to start gerrit-ssh on gerrit2001.Sep 23 2017, 12:50 AM

But it’s not writing to the logs do that is a clue there’s a bigger problem.

Dzahn added a comment.Sep 23 2017, 2:38 AM

and..after a while it dies again just by itself..

Dzahn added a comment.Sep 25 2017, 6:58 PM

We'll do T168562 and reinstall this box with stretch to, ideally, kill 2 birds with one stone. Confirm if this issue goes away and see if any other stretch blockers.

I think we found the likly culprit (Chad) found this P6046. It's not connecting to the db due to a firewall preventing it from being allowed to connect.

Dzahn added a comment.Sep 26 2017, 7:39 PM

Here is a change to add firewall rules to mariadb::misc to fix that. https://gerrit.wikimedia.org/r/#/c/380827/

Paladox added a comment.EditedSep 26 2017, 9:27 PM

@Dzahn we are going to have to manually open the port per jynus on https://gerrit.wikimedia.org/r/#/c/380827/ for now

Change 379420 had a related patch set uploaded (by Dzahn; owner: Paladox):
[operations/puppet@production] Gerrit: Enable ui for slaves

https://gerrit.wikimedia.org/r/379420

Change 379420 merged by Dzahn:
[operations/puppet@production] Gerrit: Enable ui for slaves

https://gerrit.wikimedia.org/r/379420

Well, that change above has "--enable-sshd" as option when it is a slave, and it wasn't merged.. and we were wondering why gerrit-ssh didn't come up? :) :p merged that

Paladox added a comment.EditedSep 27 2017, 6:28 AM

Gerrit ssh won’t start because it carn’t connect to the mysql db. Since chad coulden’t get init to work.

Or did you create the port hole allowing the init to run?

Dzahn added a comment.Sep 27 2017, 3:14 PM

No, i did not create a hole. I just think these are 2 unrelated issues. gerrit service doesnt start because of the db, but when it starts it doesn't start gerrrit-ssh because it didn't have --enable-sshd until now.

@Dzahn Ssh should start regardless of weather we specify that option or not since it is there just for consitency. The problem is mostly likely it not connecting to ththe db which would prevent it from either starting or starting ssh.

If gerrit carnt start then ssh won’t staty

Dzahn changed the task status from Open to Stalled.Sep 28 2017, 8:14 PM

stalled by firewall on DB

Adding DBA as we need the firewall to allow connection from m2-master.codfw.wmnet. Since the eqiad has no firewall or at least allows gerrit (cobalt) to connect.

Paladox renamed this task from Gerrit is failing to start gerrit-ssh on gerrit2001 to Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working.Sep 28 2017, 8:22 PM
Paladox updated the task description. (Show Details)

Bump.

Hey Paladox

Check the last comment form Daniel here: https://gerrit.wikimedia.org/r/#/c/380827/

@Marostegui oh thanks. Is there a way we can fix this please? As it was working before but then we restarted and connecting to the db then failed.

As it was working before

It wasn't working before- there was a security whole and a lack of infrastructure on codfw. Proxies have to be bought and setup to have proper support. If there is an emergency, and we have to use gerrit on codfw, we can do a quick exception, but if while that doesn't happen, we should setup things properly, mirroring eqiad's setup in terms of HA and enable TLS for potential cross-dc traffic, as it is the internal Wikimedia policy.

jcrespo moved this task from Triage to Backlog on the DBA board.Oct 4 2017, 2:34 PM
Dzahn added a comment.Oct 4 2017, 5:50 PM

Yea, this should just wait for the proper setup in codfw. I don't see an emergency here.

demon triaged this task as Low priority.Oct 4 2017, 8:22 PM
Paladox moved this task from Bugs & stuff to Local hacks on the Gerrit board.Oct 15 2017, 5:25 PM

Mentioned in SAL (#wikimedia-operations) [2017-12-03T03:57:02Z] <no_justification> gerrit2001: icinga is flapping on the gerrit process/systemd check, but this is kind of known (not sure why it's doing this all of a sudden). It's not letting me acknowledge it, but it's fine/harmless. Cf T176532

elukey added a subscriber: elukey.Mar 5 2018, 7:05 AM

Just added a week of downtime to gerrit2001 since icinga was spamming.

Joe added a subscriber: Joe.Mar 12 2018, 7:32 AM

Is anyone working on this issue? @Dzahn @jcrespo if neither of you is working on this or thinks to work on this soon, I'd rather move gerrit2001 to use role::spare::system for now since we're not really able to use it, and it's just spamming icinga. What do you think?

The proxies for codfw have been budget but not yet ordered, so right now we are not working on this task no.

Dzahn added a comment.Mar 12 2018, 5:06 PM

Let's not set it to role::spare::system please. That would mean actively going back and removing Gerrit as a warm stand-by. I would like to keep it and instead just fix it so that "if on in-active server then skip Icinga checks". I'll look into that.

Change 419080 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: skip gerrit process monitoring if on slave

https://gerrit.wikimedia.org/r/419080

Change 419080 merged by Dzahn:
[operations/puppet@production] gerrit: skip gerrit process monitoring if on slave

https://gerrit.wikimedia.org/r/419080

Dzahn raised the priority of this task from Low to Normal.Mar 12 2018, 11:52 PM

We still want this just as before. We were just asked to wait unless it's an emergency and we can't call it an emergency.

Change 419084 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] base/icinga: add Hiera override to skip systemd monitoring

https://gerrit.wikimedia.org/r/419084

Change 419086 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: skip systemd monitoring on gerrit2001

https://gerrit.wikimedia.org/r/419086

Change 419084 merged by Dzahn:
[operations/puppet@production] base/icinga: add Hiera override to skip systemd monitoring

https://gerrit.wikimedia.org/r/419084

Change 419086 merged by Dzahn:
[operations/puppet@production] gerrit: skip systemd monitoring on gerrit2001

https://gerrit.wikimedia.org/r/419086

  • added parameter to base monitoring class to allow disabling of systemd Icinga monitoring
  • applied to gerrit2001, works. It's removed from Icinga now.
  • removing hieradata/hosts/gerrit2001.yaml will re-enable it again, let's not forget that once this ticket is resolved

with the migration to notedb accounts and changes have been removed from the db and moved into a git repo type store.

Upstream have moved groups to notedb.

And now upstream is planning 3.0 with a blocker stating that the db needs to be dropped as a requirement to running gerrit.

You are saying we won't need any mysql/mariadb for Gerrit anymore?

Yep, but currently 2.x will still require a db just 2.15 does not read changes from the db (nor accounts) thus we could do cross dc querys?

Gerrit is now 100% NoteDB from 2.16 see https://twitter.com/GerritReview/status/1052922157712465922 . Though it still needs to connect to the DB. In 3.0 it will drop DB support. So this task will be fixed by upstream removing db support.

See https://www.gerritcodereview.com/3.0.html

In theory we could fix this with the upgrade to 2.16 (as nothing uses the db anymore but it's still required for connections)

So tagging the upgrade to 2.16 task even if we wait till 3.0 to do this.

ReviewDB has now been removed upstream.

Paladox raised the priority of this task from Normal to High.Mar 19 2019, 12:38 AM
Paladox closed this task as Resolved.Apr 4 2019, 1:24 PM

Closing this as resolved since this will be resolved with T200739

Dzahn reopened this task as Open.Apr 5 2019, 6:41 AM

let's only resolve stuff that is actually resolved, not what will be resolved in the future

Marostegui added a comment.EditedAug 1 2019, 2:03 PM

I have set up the proxy for m2 in codfw.
I know gerrit won't be using the database anymore in the future release, but I thought I would mention it here:

root@cumin1001:/home/marostegui# mysql --skip-ssl -hdbproxy2002.codfw.wmnet reviewdb -e "show tables"
+-----------------------------+
| Tables_in_reviewdb          |
+-----------------------------+
| account_external_ids        |
| account_group_by_id         |
| account_group_by_id_aud     |
| account_group_id            |
| account_group_members       |
| account_group_members_audit |
| account_group_names         |
| account_groups              |
| account_id                  |
| accounts                    |
| change_id                   |
| change_messages             |
| changes                     |
| patch_comments              |
| patch_set_approvals         |
| patch_sets                  |
| schema_version              |
| system_config               |
+-----------------------------+

Please note that that proxy points to the codfw DBs, which are obviously in read-only mode (as they are slaves from the active primary master in eqiad)

Change 527114 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Point m2-master.codfw to dbproxy2002

https://gerrit.wikimedia.org/r/527114

Change 527114 merged by Marostegui:
[operations/dns@master] wmnet: Point m2-master.codfw to dbproxy2002

https://gerrit.wikimedia.org/r/527114

Marostegui added a subscriber: MoritzMuehlenhoff.EditedAug 2 2019, 6:26 AM

Change 527114 merged by Marostegui:
[operations/dns@master] wmnet: Point m2-master.codfw to dbproxy2002
https://gerrit.wikimedia.org/r/527114

I have merged this, but this won't make gerrit2001 be able to connect as the FW is allowing cobalt and 10.0.0.0/8 and gerrit2001 isn't on any of those.
So I guess modules/profile/manifests/mariadb/ferm_misc.pp needs to include gerrit2001 if we really want to have this working (subscribing Moritz)

Not sure if it is worth doing that if at some point gerrit won't need MySQL anymore: T176532#5085020

Dzahn claimed this task.Aug 2 2019, 3:30 PM

I'll add it. Thanks Manuel!

Change 527595 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] profile::mariadb::ferm_misc: Add gerrit2001.wikimedia.org to the firewall for port 3306

https://gerrit.wikimedia.org/r/527595

Change 527595 merged by Dzahn:
[operations/puppet@production] profile::mariadb::ferm_misc: Add gerrit2001.wikimedia.org to the firewall

https://gerrit.wikimedia.org/r/527595

Mentioned in SAL (#wikimedia-operations) [2019-08-02T19:24:56Z] <mutante> gerrit2001 - re-enabling puppet, starting as slave for the first time ever, thanks to codfw dbproxy, gerrit service running (T176532)

Change 527638 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: fix sshd listen address if on a slave

https://gerrit.wikimedia.org/r/527638

Change 527638 merged by Dzahn:
[operations/puppet@production] gerrit: fix sshd listen address if on a slave

https://gerrit.wikimedia.org/r/527638

Dzahn closed this task as Resolved.Aug 2 2019, 9:02 PM

gerrit, gerrit's httpd and gerrit's sshd are now all running and listening on the right address on gerrit2001.

it's up and running as a slave. thanks DBA:)