Page MenuHomePhabricator

gerrit-reviewer-bot got stuck due to non-existing account
Closed, ResolvedPublic

Description

Git/Reviewers had a non-existing account for the "labs/tools/wikibugs2" project. Apparently, this caused gerrit-reviewer-bot to get stuck, and every time it tried to process new changes it would fail with an exception, as could be seen in https://gerrit-reviewer-bot.toolforge.org/ (though I guess it may no longer be in the last 50 times by the time someone looks at it). This is a problem because if someone adds the wrong username to that list (which can totally happen), the bot will just get stuck and not add any reviewers to any patches.

On top of that, when I realized the issue and removed the username, the bot started processing the backlog, which if the "1878 e-mails to process" line is accurate means a lot of patches, adding reviewers retroactively and potentially flooding a lot of inboxes. All this for just a non-existing username. This should be handled more gracefully.

Event Timeline

And speaking of non-existing accounts, it looks like the bot should also trim the usernames, because it just failed on another one which had spaces around it (diff).

it would fail with an exception, as could be seen in https://gerrit-reviewer-bot.toolforge.org/ (though I guess it may no longer be in the last 50 times by the time someone looks at it).

Great finding Daimona, as a suggestion for next time: copy paste to the task so we get a permanent record attached to the task. Then I guess the log are available to the tool.

In the grand scheme of thing we should phase out the bot in favor of Gerrit reviewers plugin https://gerrit.wikimedia.org/r/plugins/reviewers/Documentation/index.html which probably deserves its own task.

Typical error log entry:

{'id': 'operations%2Fpuppet~production~I5017ee944346ac8bb797b4ef862156418f6a3eb1', 'project': 'operations/puppet', 'branch': 'production', 'attention_set': {'9922': {'account': {'_account_id': 9922, 'name': 'Hokwelum', 'email': 'hokwelum@wikimedia.org', 'username': 'hokwelum'}, 'last_update': '2023-07-17 22:51:15.000000000', 'reason': 'Reviewer was added'}, '10626': {'account': {'_account_id': 10626, 'name': 'Jennifer Ebe', 'email': 'jebe@wikimedia.org', 'username': 'jebe'}, 'last_update': '2023-07-17 22:51:15.000000000', 'reason': 'Reviewer was added'}, '9379': {'account': {'_account_id': 9379, 'name': 'Btullis', 'email': 'btullis@wikimedia.org', 'username': 'btullis'}, 'last_update': '2023-07-17 17:52:43.000000000', 'reason': '<GERRIT_ACCOUNT_6> replied on the change', 'reason_account': {'_account_id': 6, 'name': 'ArielGlenn', 'email': 'ariel@wikimedia.org', 'username': 'ariel'}}}, 'hashtags': [], 'change_id': 'I5017ee944346ac8bb797b4ef862156418f6a3eb1', 'subject': 'make sure certain systemd jobs run only on the primary xml dumps NFS shares', 'status': 'NEW', 'created': '2023-07-17 10:11:43.000000000', 'updated': '2023-07-17 22:51:15.000000000', 'submit_type': 'REBASE_IF_NECESSARY', 'mergeable': True, 'insertions': 35, 'deletions': 3, 'total_comment_count': 3, 'unresolved_comment_count': 1, 'has_review_started': True, 'meta_rev_id': 'd740927c470fc97cec7a9889f2fb9540bbd882c5', '_number': 938816, 'owner': {'_account_id': 6, 'name': 'ArielGlenn', 'email': 'ariel@wikimedia.org', 'username': 'ariel'}, 'current_revision': 'a6c30abf227160608c960199bb99b273bc6ff146', 'revisions': {'a6c30abf227160608c960199bb99b273bc6ff146': {'kind': 'REWORK', '_number': 6, 'created': '2023-07-17 18:30:11.000000000', 'uploader': {'_account_id': 6, 'name': 'ArielGlenn', 'email': 'ariel@wikimedia.org', 'username': 'ariel'}, 'ref': 'refs/changes/16/938816/6', 'fetch': {'anonymous http': {'url': 'https://gerrit.wikimedia.org/r/operations/puppet', 'ref': 'refs/changes/16/938816/6'}}, 'files': {'modules/role/manifests/dumps/generation/server/xmlfallback.pp': {'lines_inserted': 2, 'size_delta': 129, 'size': 791}, 'modules/profile/manifests/dumps/generation/server/exceptionchecker.pp': {'lines_inserted': 2, 'size_delta': 111, 'size': 434}, 'modules/role/manifests/dumps/generation/server/misccrons.pp': {'lines_inserted': 2, 'size_delta': 129, 'size': 669}, 'hieradata/common/profile/dumps/generation/server.yaml': {'status': 'A', 'lines_inserted': 8, 'size_delta': 314, 'size': 314}, 'hieradata/common/profile/dumps/generation/server/xmldumps.yaml': {'status': 'A', 'lines_inserted': 4, 'size_delta': 220, 'size': 220}, 'hieradata/common/profile/dumps.yaml': {'lines_inserted': 2, 'size_delta': 102, 'size': 1761}, 'hieradata/hosts/dumpsdata1006.yaml': {'lines_deleted': 1, 'size_delta': -65, 'size': 591}, 'modules/dumps/manifests/generation/server/exceptionchecker.pp': {'lines_inserted': 3, 'lines_deleted': 1, 'size_delta': 67, 'size': 1160}, 'modules/role/manifests/dumps/generation/server/spare.pp': {'lines_inserted': 3, 'size_delta': 167, 'size': 684}, 'hieradata/common/profile/dumps/generation/server/alldumps.yaml': {'status': 'A', 'lines_inserted': 4, 'size_delta': 220, 'size': 220}, 'modules/profile/manifests/dumps/generation/server/jobswatcher.pp': {'lines_inserted': 2, 'size_delta': 106, 'size': 534}, 'modules/dumps/manifests/generation/server/jobswatcher.pp': {'lines_inserted': 3, 'lines_deleted': 1, 'size_delta': 67, 'size': 1022}}}}, 'requirements': [], 'submit_records': [{'rule_name': 'gerrit~DefaultSubmitRule', 'status': 'NOT_READY', 'labels': [{'label': 'Verified', 'status': 'OK', 'applied_by': {'_account_id': 75, 'name': 'jenkins-bot', 'username': 'jenkins-bot', 'tags': ['SERVICE_USER']}}, {'label': 'Code-Review', 'status': 'NEED'}]}]}
 caused exception:Traceback (most recent call last):
  File "/data/project/gerrit-reviewer-bot/src/gerrit-reviewer-bot/pop3bot.py", line 134, in <module>
    main()
  File "/data/project/gerrit-reviewer-bot/src/gerrit-reviewer-bot/pop3bot.py", line 122, in main
    add_reviewers(changeset['id'], reviewers)
  File "/data/project/gerrit-reviewer-bot/src/gerrit-reviewer-bot/add_reviewer.py", line 150, in add_reviewers
    raise Exception(command + ' was not executed successfully (code %i)' % retval)
Exception: gerrit set-reviewers --add ' Xcollazo ' --add jebe --add Hokwelum 'operations%2Fpuppet~production~I5017ee944346ac8bb797b4ef862156418f6a3eb1' was not executed successfully (code 1)

Interestingly, the original issue was triggered by a username that _does_ exist (namely, my test account). Not sure if some Gerrit backend database was reset, or something along those lines.

In terms of whether something should change: These kinds of issues occur, albeit so rarely (maybe once a year over the last 10 years) that I can't really be bothered to find a solution (how do you know whether a username is valid? "Merlijn van Deen - alternative" exists in LDAP so I really don't know why Gerrit doesn't consider it valid).

It certainly could be changed. If you're bothered enough by it to find a solution (one that isn't 'just ignore all errors from the Gerrit backend' -- I very much prefer the bot to hang and queue than to irreparably skip adding reviewers) then I'd be happy to merge it.

In the grand scheme of things... this will all be Gitlab, and we'll live happily ever after. See T289712.

I can see the ssh commands on Gerrit side:

sshd_log.2023-07-17.gz:
[2023-07-17T22:40:21.473Z] 2f327894
[SSH gerrit set-reviewers --add Merlijn van Deen --add Merlijn van Deen - alternative labs%2Ftools%2Fwikibugs2~master~I4a5e9e60d87b4de0b52e3313dbb417f470168080 (reviewer-bot)]
reviewer-bot a/578 gerrit.set-reviewers.--add.Merlijn van Deen.--add.Merlijn van Deen - alternative.labs%2Ftools%2Fwikibugs2~master~I4a5e9e60d87b4de0b52e3313dbb417f470168080 4ms 37ms - 1 - 35ms 30ms 9435352

If I try it locally:

$ ssh -p 29418 gerrit.wikimedia.org gerrit set-reviewers I4a5e9e60d87b4de0b52e3313dbb417f470168080 --add 'Merlijn van Deen - alternative'
fatal: "van" no such change

With double quoting (since ssh protocol passes a single string which is then processed by sh):

$ ssh -p 29418 hashar@gerrit.wikimedia.org gerrit set-reviewers I4a5e9e60d87b4de0b52e3313dbb417f470168080 --add "'Merlijn van Deen - alternative'"
error: Account 'Merlijn van Deen - alternative' not found
Merlijn van Deen - alternative does not identify a registered user or group
fatal: one or more updates failed; review output above

That is given to standard error which I guess the bot could capture and log?

I could not find an entry for Merlijn van Deen - alternative in the Gerrit user database. Which LDAP account is it associated with? If that alternate account reuse the same email address as another account (which is permitted by Wikitech), Gerrit refuses to create the user in its database since it requires emails to be unique. That might be an explanation.

Merlijn van Deen - alternative — valhallasw-alt
from LDAP

uid 14791
shell name valhallasw-alt
wikitech user name Merlijn van Deen - alternative
phabricator user
member of groups

project-bastion
project-tools

Indeed I think the issue is that the email address is the same. I'm not entirely sure why it worked before, but indeed no big deal if it doesn't anymore.

That explains it. I double checked and there is no such user in Gerrit for sure.

Gerrit has a sshd log which has the username and commands, grepping for - alternative:

$ zgrep -c ' \- alternative' sshd_log.20*.gz
sshd_log.2023-06-18.gz:0
sshd_log.2023-06-19.gz:64
sshd_log.2023-06-20.gz:0
sshd_log.2023-06-21.gz:0
sshd_log.2023-06-22.gz:0
sshd_log.2023-06-23.gz:0
sshd_log.2023-06-24.gz:0
sshd_log.2023-06-25.gz:0
sshd_log.2023-06-26.gz:0
sshd_log.2023-06-27.gz:0
sshd_log.2023-06-28.gz:0
sshd_log.2023-06-29.gz:0
sshd_log.2023-06-30.gz:584
sshd_log.2023-07-01.gz:556
sshd_log.2023-07-02.gz:0
sshd_log.2023-07-03.gz:0
sshd_log.2023-07-04.gz:0
sshd_log.2023-07-05.gz:4
sshd_log.2023-07-06.gz:0
sshd_log.2023-07-07.gz:0
sshd_log.2023-07-08.gz:0
sshd_log.2023-07-09.gz:0
sshd_log.2023-07-10.gz:0
sshd_log.2023-07-11.gz:0
sshd_log.2023-07-12.gz:80
sshd_log.2023-07-13.gz:1148
sshd_log.2023-07-14.gz:1144
sshd_log.2023-07-15.gz:1152
sshd_log.2023-07-16.gz:1148
sshd_log.2023-07-17.gz:1088
sshd_log.2023-07-18.gz:0

https://gerrit.wikimedia.org/r/c/labs/tools/wikibugs2/+/935728 was uploaded on July 5th and the bot added you there. And here are the four ssh commands:

[2023-07-05T12:15:11.623Z] 91b0108c [SSH gerrit set-reviewers --add Merlijn van Deen --add Merlijn van Deen - alternative labs%2Ftools%2Fwikibugs2~master~Ief9e50031137849a12be2c95816c7996a6a9fe9e (reviewer-bot)] reviewer-bot a/578 gerrit.set-reviewers.--add.Merlijn van Deen.--add.Merlijn van Deen - alternative.labs%2Ftools%2Fwikibugs2~master~Ief9e50031137849a12be2c95816c7996a6a9fe9e 4ms 139ms - 1 - 57ms 60ms 14708240
[2023-07-05T12:15:11.773Z] 9114906b [SSH gerrit set-reviewers --add Merlijn van Deen --add Merlijn van Deen - alternative labs%2Ftools%2Fwikibugs2~master~Ief9e50031137849a12be2c95816c7996a6a9fe9e (reviewer-bot)] reviewer-bot a/578 gerrit.set-reviewers.--add.Merlijn van Deen.--add.Merlijn van Deen - alternative.labs%2Ftools%2Fwikibugs2~master~Ief9e50031137849a12be2c95816c7996a6a9fe9e 3ms 32ms - 1 - 30ms 30ms 9406656
[2023-07-05T12:15:19.333Z] c36d5e67 [SSH gerrit set-reviewers --add Merlijn van Deen --add Merlijn van Deen - alternative labs%2Ftools%2Fwikibugs2~master~Ief9e50031137849a12be2c95816c7996a6a9fe9e (reviewer-bot)] reviewer-bot a/578 gerrit.set-reviewers.--add.Merlijn van Deen.--add.Merlijn van Deen - alternative.labs%2Ftools%2Fwikibugs2~master~Ief9e50031137849a12be2c95816c7996a6a9fe9e 4ms 39ms - 1 - 37ms 40ms 9440920
[2023-07-05T12:15:19.503Z] 83b2062f [SSH gerrit set-reviewers --add Merlijn van Deen --add Merlijn van Deen - alternative labs%2Ftools%2Fwikibugs2~master~Ief9e50031137849a12be2c95816c7996a6a9fe9e (reviewer-bot)] reviewer-bot a/578 gerrit.set-reviewers.--add.Merlijn van Deen.--add.Merlijn van Deen - alternative.labs%2Ftools%2Fwikibugs2~master~Ief9e50031137849a12be2c95816c7996a6a9fe9e 2ms 39ms - 1 - 37ms 30ms 9440656

That one at least did not go in a loop.

On July 12th, I see the same pattern of four ssh connections in a row spreads every five minutes. The first one being at 2023-07-12T22:20:15 UTC and the requests were for https://gerrit.wikimedia.org/r/c/labs/tools/wikibugs2/+/937538.

That is all I got :/

The long term fix would be to phase out the bot and migrate the logic toward the reviewers plugin ( https://gerrit.wikimedia.org/r/plugins/reviewers/Documentation/config.md ). I am not sure whether the feature set match though, if so the migration might not be that complicated:

  • Parse the Wiki page
  • Generate reviewers.config files to be uploaded to the projects refs/meta/config
  • Write end user documentation explaining how to edit it (it is in repo settingsCommandsReviewers Config (example for mediawiki/core) but that requires to be an owner of the repo and there is no GUI to propose a modification.

I guess we can mark this solved for now since the root cause got found?

hashar assigned this task to valhallasw.

Marking that one closed since the source of the issue has been found (wrong entry) and worked around (remove the wrong entry).

The long term fix would be convert Git/Reviewers entries to Gerrit reviewers.config and let the bot retired peacefully :]