Page MenuHomePhabricator

Gerrit: Restarting gerrit could lead to data loss + maybe accounts
Closed, ResolvedPublic

Description

Steps to reproduce

  1. Create some changes in the gerrit ui or through your pc and git push it to gerrit.
  1. restart gerrit
  1. try accessing those changes from gerrit's ui

Actual results

  • You get data loss and currently there are no known workarounds to prevent this data loss.

Expected results

  • I expect that gerrit should restart without any data loss.

Reported here https://bugs.chromium.org/p/gerrit/issues/detail?id=5200

Note: See these threads https://groups.google.com/forum/#!topic/repo-discuss/o80gRbwziS8 and https://groups.google.com/forum/#!topic/repo-discuss/bUmnCnQ9A20

I managed to reproduce this on a test install.

Please do not lower the priority until someone has investigated this impact please. I have no idea what affects it has to accounts, but to open / merged patches it will remove them from display thus leading to data loss, could cause repo corruption as it will be missing the changes.

Also it seems some other large organisation has halted the update due to the problem since they doint want to reindex which will take 4+ hours every time they restart.

Event Timeline

Paladox created this task.Dec 27 2016, 10:28 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 27 2016, 10:28 PM
Paladox triaged this task as Unbreak Now! priority.EditedDec 27 2016, 10:30 PM

Changing priority to unbreak as someone needs to investigate it's impact on prod, as i managed to reproduce it in labs with a simple restart and i lost all my changes i created before the last restart.

I found this problem a week ago but as it was christmas not many people were around so i was waiting to ask under which tag i should report this, security or open.

Please do not lower the priority without accessing weather there is any impact on prod.

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptDec 27 2016, 10:30 PM

This bug may have caused T153079 since that problem happened around the time after we upgraded to gerrit 2.13.

Paladox moved this task from Backlog to Reported Upstream on the Upstream board.Dec 27 2016, 11:23 PM

That thread says online reindex works until the next restart so that could be a workaround. But I doint know what other sides affects of doing the online reindex could cause.

But the online reindex dosent seem to work for some users

According to the thread the index for accounts is affected too. Though I have no idea if it affects ldap but it does store account details in All-Users so I guess this affects ldap too.

Paladox renamed this task from Gerrit: Restarting gerrit could lead to data loss to Gerrit: Restarting gerrit could lead to data loss + accounts.Dec 28 2016, 2:31 PM
Paladox updated the task description. (Show Details)Dec 28 2016, 5:45 PM
Paladox updated the task description. (Show Details)
Paladox updated the task description. (Show Details)Dec 28 2016, 5:48 PM
Paladox updated the task description. (Show Details)Dec 28 2016, 6:09 PM

Adding Security as it affects accounts, though no one can steal accounts as far as i can tell, account data could be lost.

Paladox added a comment.EditedDec 28 2016, 6:40 PM

This may be the reason why we had this T152640 problem as accounts are stored in the index now. All the fix did was prevent deletion of the external id, thus allowing users to login. But if we restart before gerrit has a chance to reindex the changes then we lose accounts + patches from the index. But because someone could upload or create an account at any time even just after it did an reindex the problem will still be there.

Note: Reason i haven't tried reproducing accounts yet is because the test install i have uses the basic gerrit login system, i.e. no passwords as it is a test install.

How often has Wikimedia Gerrit been restarted in the last, say, 3 months?

Adding Security as it affects accounts, though no one can steal accounts as far as i can tell, account data could be lost.

So what is the actual security problem? Did you manage to locally reproduce "account data" loss by restarting?

@Aklapper i haven't tried yet reproducing data loss for accounts, but i managed with patches without even trying as i was doing something else which then got my attention when i found after restarting the patches just would not load.

Also gerrit was restarted a few weeks ago for the upgrade.

Also gerrit has to be restarted for config changes when ever there are config changes

Peachey88 added a subscriber: Peachey88.

Do we actually have account_data that can be lost (and cause issues)? since all our account data is synced from a external source (LDAP) and we don't allow local accounts to be created.

Removing Security for the mean time as no actual security issues have been floated yet

Paladox added a comment.EditedDec 29 2016, 2:54 AM

@Peachey88 even though we use ldap, gerrit stores all accounts in the index. Which is what caused T152640 that problem in the first place. It stores users preferences + ssh key in a git repo.

Paladox added a comment.EditedDec 29 2016, 4:50 PM

This is the commit that broke it https://gerrit-review.googlesource.com/#/c/86804/

we can revert it and apply it locally just to fix the index when we restart gerrit.

User at https://bugs.chromium.org/p/gerrit/issues/detail?id=5200 (bottom) found the broken commit (User remy.boh...@gmail.com)

@demon ^^

I reverted that commit locally and it worked for me when i restarted gerrit.

it will require an index after we upgrade gerrit to include ^^ just so we doint lose any commits.

this https://gerrit-review.googlesource.com/#/c/86804/ also broke it where you merge a change, restart gerrit, look at dashboard or open changes and you see the patch is still open but clicking on it and you will receive an error. You may also get into the patch and it's displayed as merged.

Upstream have figured out a fix that will fix both online reindex and clean shutdown (@Luca fixed it :)) See https://gerrit-review.googlesource.com/#/c/93479/

I'm going to test it now.

Paladox renamed this task from Gerrit: Restarting gerrit could lead to data loss + accounts to Gerrit: Restarting gerrit could lead to data loss + maybe accounts .Dec 30 2016, 4:23 PM

I have now tested https://gerrit-review.googlesource.com/#/c/93479/ and it works.

@demon could we cherry-pick this https://gerrit-review.googlesource.com/#/c/93479/ and do the reindex please otherwise we will find changes missing? We will only need to do the reindex once more as this https://gerrit-review.googlesource.com/#/c/93479/ should fix clean shutdown.

This deftly looks like it caused T153079 as he managed to force merge without an object error but gerrit was suggesting there was a object missing, if that was true then he would have got the same error when force merging.

Paladox added a comment.EditedDec 31 2016, 8:47 PM

Actually T153079 had nothing to do with the index but was the submodule that was the problem.

But other users noticed that when we did the cherry pick of those commits and then restarted, they noticed that merged changes weren't being showed as being merged on the dashboard but in reality they were mered.

Anyways upstream merged the change, and should now be fixed in gerrit 2.13.5. We should just pull in the upstreams stable-2.13 branch.

Paladox moved this task from Bugs & stuff to Maybe fixed? on the Gerrit board.Jan 2 2017, 2:55 PM
Paladox moved this task from Reported Upstream to Patch merged upstream on the Upstream board.

Apparently the way they fixed it was the wrong way, but they are keeping it as is on the stable branch as it works.

They will try a different fix for the master branch.

Per luca at https://gerrit-review.googlesource.com/#/c/93479/ but we should still backport the fix as it actually works and prevents data loss at shutdown.

Change 330255 had a related patch set uploaded (by Chad):
gerrit (2.13.4-wmf.1) jessie-wikimedia; urgency=low

https://gerrit.wikimedia.org/r/330255

Apparently the way they fixed it was the wrong way, but they are keeping it as is on the stable branch as it works.
They will try a different fix for the master branch.

I may sound like a broken record, but please provide links / proof for any such statements. In general. Thanks.

Change 330255 merged by Filippo Giunchedi:
gerrit (2.13.4-wmf.1) jessie-wikimedia; urgency=low

https://gerrit.wikimedia.org/r/330255

demon closed this task as Resolved.Jan 6 2017, 4:12 AM
demon claimed this task.

This rolled out today