Puppet is having issues on the "kraken-ocr.wikisource.eqiad1.wikimedia.cloud (172.16.5.149)" instance in project
wikisource in Wikimedia Cloud VPS.
Puppet is running with failures.
Working Puppet runs are needed to maintain instance security and logins.
As long as Puppet continues to fail, this system is in danger of becoming
unreachable.
You are receiving this email because you are listed as member for the
project that contains this instance. Please take steps to repair
this instance or contact a Cloud VPS admin for assistance.
If your host is expected to fail puppet runs and you want to disable this
alert, you can create a file under /.no-puppet-checks, that will skip the checks.
You might find some help here:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Cloud_VPS_alert_Puppet_failure_on
For further support, visit #wikimedia-cloud on libera.chat or
<https://wikitech.wikimedia.org>
Some extra info follows:
---- Last run summary:
application:
converged_environment: production
initial_environment: production
run_mode: agent
changes:
total: 0
events:
failure: 1
success: 0
total: 1
resources:
changed: 0
corrective_change: 1
failed: 1
failed_to_restart: 0
out_of_sync: 1
restarted: 0
scheduled: 0
skipped: 0
total: 535
time:
augeas: 0.004748811
catalog_application: 6.164173318000394
concat_file: 0.002344187
concat_fragment: 0.0017932980000000002
config_retrieval: 4.284946191997733
convert_catalog: 0.3030783879949013
exec: 0.09367868200000001
fact_generation: 2.2667200500000035
file: 1.9587983389999994
file_line: 0.001505548
filebucket: 0.000159436
group: 0.000449444
last_run: 1777450486
package: 0.5152029549999999
plugin_sync: 0.6136276790057309
schedule: 0.0011646690000000001
service: 1.8913452989999997
startup_time: 1.235490566
tidy: 0.000100415
total: 15.524327943
transaction_evaluation: 6.103091213997686
user: 0.00065793
version:
config: '(6931773bd6) Elukey - restbase: migrate envoy TLS proxy services to new
intermediate'
puppet: 7.23.0
---- Failed resources if any:
* Service[sssd]
--- Last run log:
ERR: Systemd start for sssd failed!
journalctl log for sssd:
Apr 29 08:14:44 kraken-ocr systemd[1]: Starting sssd.service - System Security Services Daemon...
Apr 29 08:14:44 kraken-ocr sssd[62408]: SSSD couldn't load the configuration database [1432158322]: File ownership and permissions check failed
Apr 29 08:14:44 kraken-ocr systemd[1]: sssd.service: Main process exited, code=exited, status=4/NOPERMISSION
Apr 29 08:14:44 kraken-ocr systemd[1]: sssd.service: Failed with result 'exit-code'.
Apr 29 08:14:44 kraken-ocr systemd[1]: Failed to start sssd.service - System Security Services Daemon.
ERR: change from 'stopped' to 'running' failed: Systemd start for sssd failed!
journalctl log for sssd:
Apr 29 08:14:44 kraken-ocr systemd[1]: Starting sssd.service - System Security Services Daemon...
Apr 29 08:14:44 kraken-ocr sssd[62408]: SSSD couldn't load the configuration database [1432158322]: File ownership and permissions check failed
Apr 29 08:14:44 kraken-ocr systemd[1]: sssd.service: Main process exited, code=exited, status=4/NOPERMISSION
Apr 29 08:14:44 kraken-ocr systemd[1]: sssd.service: Failed with result 'exit-code'.
Apr 29 08:14:44 kraken-ocr systemd[1]: Failed to start sssd.service - System Security Services Daemon.
(corrective)
NOTICE: Applied catalog in 6.16 seconds
---- Exceptions that happened when running the script if any:
No exceptions happened.Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | Feature | None | T345055 Add kraken OCR engine to Wikimedia OCR | ||
| Resolved | taavi | T424818 [Cloud VPS alert][wikisource] Puppet failure on kraken-ocr.wikisource.eqiad1.wikimedia.cloud |
Event Timeline
I'm afraid that I caused this issue with an update to Debian trixie. Maybe puppet was uninstalled accidentally. I cannot fix it, because the VM no longer accepts my SSH keys, and I have no other access like for example a VNC console.
I also can't log in. I guess SSH keys aren't being copied on to the server.
Should we recreate it with debian-13.0-trixie, given that it's currently on debian-12.0-bookworm (deprecated 2023-11-27)?
A recreate with Debian trixie would be fine for me. Maybe this is easier than fixing the current installation.
The machine was in the middle of a dist-upgrade, with some packages upgraded and some still on old releases, which caused sssd (the authentication daemon) to fail. Upgrading all of its components made it work again. We do still recommend re-creating the instances instead of in-place upgrades: it helps you check that those instances are still re-creatable, it makes sure all the metadata is correct (e.g. this will still show up as bookworm when we start nagging for its removals), and avoids issues like this.