Page MenuHomePhabricator

scap in job beta-scap-eqiad has random failures: eval.php cannot find group 50062
Closed, DuplicatePublic

Description

The Jenkins job beta-scap-eqiad randomly fails with either of:

Call to mwscript eval.php stderr: groups: cannot find name for group ID 50062:

13:10:38 Checking for new runtime errors locally
13:10:42 Scap failed!: Call to mwscript eval.php stderr: not empty
13:10:42 Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 342, in run
    exit_status = app.main(app.extra_arguments)
  File "/usr/lib/python2.7/dist-packages/scap/main.py", line 603, in main
    return super(Scap, self).main(*extra_args)
  File "/usr/lib/python2.7/dist-packages/scap/main.py", line 80, in main
    self._check_fatals()
  File "/usr/lib/python2.7/dist-packages/scap/main.py", line 214, in _check_fatals
    raise RuntimeError(errmsg.format("stderr", errout))
RuntimeError: Scap failed!: Call to mwscript eval.php stderr: groups: cannot find name for group ID 50062
13:10:42 scap failed: RuntimeError Scap failed!: Call to mwscript eval.php stderr: groups: cannot find name for group ID 50062 (duration: 00m 03s)

GID 50062 on WMCS is group project-bastion.

See also: T217587: Class 'Memcached' not found for php7 in beta

Event Timeline

hashar created this task.Mar 4 2019, 9:28 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 4 2019, 9:28 AM
hashar updated the task description. (Show Details)Mar 4 2019, 9:29 AM
hashar renamed this task from scap in job beta-scap-eqiad has random failures (eval.php cannot find group 50062 or returns None) to scap in job beta-scap-eqiad has random failures: eval.php cannot find group 50062.Mar 4 2019, 6:09 PM
hashar updated the task description. (Show Details)
hashar added a comment.Mar 4 2019, 6:13 PM

I thought about a LDAP failure, I could not find any evidence in deployment-deploy01 /var/log though :-(

hashar added a comment.Mar 4 2019, 6:23 PM

Ditto on https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/32180/console

Traceback (most recent call last):
  File "/usr/local/bin/wmf-beta-update-databases.py", line 101, in <module>
    sys.exit(main())
  File "/usr/local/bin/wmf-beta-update-databases.py", line 97, in main
    run_updates(dblist, args.batch)
  File "/usr/local/bin/wmf-beta-update-databases.py", line 53, in run_updates
    do_wait(procs)
  File "/usr/local/bin/wmf-beta-update-databases.py", line 33, in do_wait
    raise Exception("command: ", cmd, "output: ", f.read())
Exception: ('command: ', "echo 'jawiki'; /usr/local/bin/mwscript update.php --wiki=jawiki --quick", 'output: ', 'jawiki\ngroups: cannot find name for group ID 50120\n\nWe trust you have received the usual lecture from the local System\nAdministrator. It usually boils down to these three things:\n\n    #1) Respect the privacy of others.\n    #2) Think before you type.\n    #3) With great power comes great responsibility.\n\nsudo: no tty present and no askpass program specified\n')

Eg: mwscript update.php dies to a group lookup failure: groups: cannot find name for group ID 50120.

Seems both LDAP server had troubles reporting metrics to Grafana. They also have some higher than usual CPU/Load/# of processes.

cloud-services-team are you aware of any ongoing issue with the WMCS LDAP servers? We seem to have random lookup failure for groups. I also witnessed at least a couple occasion when I got rejected ssh access (thought I forgot to dig in the log to look up the error).

I thought about a LDAP failure, I could not find any evidence in deployment-deploy01 /var/log though :-(

There are a few errors on deployment-deploy01 that might be related:

Mar  4 06:52:01 deployment-deploy01 nslcd[565]: [d8b661] <group/member="prometheus"> ldap_result() failed: Can't contact LDAP server
Mar  4 06:55:18 deployment-deploy01 nslcd[565]: [3abb1a] <group/member="puppet"> ldap_result() failed: Can't contact LDAP server
Mar  4 07:54:01 deployment-deploy01 nslcd[565]: [049a49] <group/member="helm"> ldap_result() failed: Can't contact LDAP server
Mar  4 07:54:16 deployment-deploy01 nslcd[565]: [95abb6] <group/member="root"> ldap_result() failed: Can't contact LDAP server
Mar  4 07:55:32 deployment-deploy01 nslcd[565]: [722a39] <group=50120> ldap_result() failed: Can't contact LDAP server

Seems likely to be the same root cause as T217280: LDAP server running out of memory frequently and disrupting Cloud VPS clients. Should we merge these together?

Seems likely to be the same root cause as T217280: LDAP server running out of memory frequently and disrupting Cloud VPS clients. Should we merge these together?

Yes that looks alike and the timing matches my experience of beta-scap-eqiad job failures.