Page MenuHomePhabricator

[Cloud VPS alert][wikisource] Puppet failure on kraken-ocr.wikisource.eqiad1.wikimedia.cloud
Closed, ResolvedPublic

Description

Puppet is having issues on the "kraken-ocr.wikisource.eqiad1.wikimedia.cloud (172.16.5.149)" instance in project
wikisource in Wikimedia Cloud VPS.

Puppet is running with failures.

Working Puppet runs are needed to maintain instance security and logins.
As long as Puppet continues to fail, this system is in danger of becoming
unreachable.

You are receiving this email because you are listed as member for the
project that contains this instance.  Please take steps to repair
this instance or contact a Cloud VPS admin for assistance.

If your host is expected to fail puppet runs and you want to disable this
alert, you can create a file under /.no-puppet-checks, that will skip the checks.

You might find some help here:
    https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Cloud_VPS_alert_Puppet_failure_on

For further support, visit #wikimedia-cloud on libera.chat or
<https://wikitech.wikimedia.org>

Some extra info follows:
---- Last run summary:
application:
  converged_environment: production
  initial_environment: production
  run_mode: agent
changes:
  total: 0
events:
  failure: 1
  success: 0
  total: 1
resources:
  changed: 0
  corrective_change: 1
  failed: 1
  failed_to_restart: 0
  out_of_sync: 1
  restarted: 0
  scheduled: 0
  skipped: 0
  total: 535
time:
  augeas: 0.004748811
  catalog_application: 6.164173318000394
  concat_file: 0.002344187
  concat_fragment: 0.0017932980000000002
  config_retrieval: 4.284946191997733
  convert_catalog: 0.3030783879949013
  exec: 0.09367868200000001
  fact_generation: 2.2667200500000035
  file: 1.9587983389999994
  file_line: 0.001505548
  filebucket: 0.000159436
  group: 0.000449444
  last_run: 1777450486
  package: 0.5152029549999999
  plugin_sync: 0.6136276790057309
  schedule: 0.0011646690000000001
  service: 1.8913452989999997
  startup_time: 1.235490566
  tidy: 0.000100415
  total: 15.524327943
  transaction_evaluation: 6.103091213997686
  user: 0.00065793
version:
  config: '(6931773bd6) Elukey - restbase: migrate envoy TLS proxy services to new
    intermediate'
  puppet: 7.23.0


---- Failed resources if any:

  * Service[sssd]

--- Last run log:

ERR: Systemd start for sssd failed!
journalctl log for sssd:
Apr 29 08:14:44 kraken-ocr systemd[1]: Starting sssd.service - System Security Services Daemon...
Apr 29 08:14:44 kraken-ocr sssd[62408]: SSSD couldn't load the configuration database [1432158322]: File ownership and permissions check failed
Apr 29 08:14:44 kraken-ocr systemd[1]: sssd.service: Main process exited, code=exited, status=4/NOPERMISSION
Apr 29 08:14:44 kraken-ocr systemd[1]: sssd.service: Failed with result 'exit-code'.
Apr 29 08:14:44 kraken-ocr systemd[1]: Failed to start sssd.service - System Security Services Daemon.

ERR: change from 'stopped' to 'running' failed: Systemd start for sssd failed!
journalctl log for sssd:
Apr 29 08:14:44 kraken-ocr systemd[1]: Starting sssd.service - System Security Services Daemon...
Apr 29 08:14:44 kraken-ocr sssd[62408]: SSSD couldn't load the configuration database [1432158322]: File ownership and permissions check failed
Apr 29 08:14:44 kraken-ocr systemd[1]: sssd.service: Main process exited, code=exited, status=4/NOPERMISSION
Apr 29 08:14:44 kraken-ocr systemd[1]: sssd.service: Failed with result 'exit-code'.
Apr 29 08:14:44 kraken-ocr systemd[1]: Failed to start sssd.service - System Security Services Daemon.
 (corrective)
NOTICE: Applied catalog in 6.16 seconds

---- Exceptions that happened when running the script if any:
  No exceptions happened.

Event Timeline

I'm afraid that I caused this issue with an update to Debian trixie. Maybe puppet was uninstalled accidentally. I cannot fix it, because the VM no longer accepts my SSH keys, and I have no other access like for example a VNC console.

I also can't log in. I guess SSH keys aren't being copied on to the server.

Should we recreate it with debian-13.0-trixie, given that it's currently on debian-12.0-bookworm (deprecated 2023-11-27)?

A recreate with Debian trixie would be fine for me. Maybe this is easier than fixing the current installation.

taavi claimed this task.
taavi subscribed.

The machine was in the middle of a dist-upgrade, with some packages upgraded and some still on old releases, which caused sssd (the authentication daemon) to fail. Upgrading all of its components made it work again. We do still recommend re-creating the instances instead of in-place upgrades: it helps you check that those instances are still re-creatable, it makes sure all the metadata is correct (e.g. this will still show up as bookworm when we start nagging for its removals), and avoids issues like this.

Thanks @taavi! Next time we'll destroy and re-create with a clean image.