Overview
We have recently bootstrapped a new ceph cluster in T330149
Doing so highlighted a couple of problems about the way that puppet behaves when bootstrapping monitor nodes.
This ticket exists in order to record any activity on investigating and fixing these problems.
It will be useful to carry out this work on the new ceph cluster, before it goes into production.
However, we must be mindful that the ceph module is also in use by the cloudceph clusters that are in production and are managed by the cloud-services-team.
Whilst the puppet module is shared, the cloudceph cluster currently uses its own puppet profiles and a different version of Ceph (15 vs 17)
Any one or more of these factors might have an impact on how the behaviour of bootstrapping a cluster, a monitor daemon, and a manager daemon, might differ between the two cases.
Observations
1: Bootstrapping a new monitor fails with a named monitor key
When bootstrapping a monitor server, whether this is for a new cluster or an existing cluster, puppet runs the following command on each monitor
/usr/bin/ceph-mon --mkfs -i ${::hostname} --fsid ${fsid} --keyring ${temp_keyring}
This exec is defined here.
Note that we are using the method of expanding-with-initial-members defined here, since we already have an /etc/ceph/ceph.conf present, which contains the mon initial members option.
The $temp_keyring contains two keys, concatenated into one text file: /var/lib/ceph/tmp/ceph.mon.keyring
- The contents of the monitor authentcation key: mon.${::hostname}
- The contents of the client admin key
The following screenshot shows this temporary file, with keydata redacted. It also highlights the named section of the monitor key.
When this command is executed on a mon node running ceph quincy, the result is that the file /var/lib/ceph/mon/ceph-$hostname/keyring is not created. Attempting to start the mon service results in errors as shown.
auth: error reading file: /var/lib/ceph/mon/ceph-cephosd1001/keyring: can't open /var/lib/ceph/mon/ceph-cephosd1001/keyring: (2) No such file or directory mon.cephosd1001@-1(???) e0 unable to load initial keyring /var/lib/ceph/mon/ceph-cephosd1001/keyring
The documents (https://docs.ceph.com/en/quincy/dev/mon-bootstrap/#secret-keys) refer to a secret key named mon. instead of a key named mon.$hostname
The mon. secret key is stored a keyring file in the mon data directory.
In order to bootstrap the cluster, I manually modified /var/lib/ceph/tmp/ceph.mon.keyring on each of the servers, removing the hostname from the monitor key, leaving [mon.] in its place. I then executed the command manually.
Upon doing so, the keyring was created in the correct location and the mon process started successfully.
2: Timeouts running puppet (ceph auth) before the cluster is bootstrapped
TODO explain observation
3: The mgr keys were created instead of being imported so the keydata did not match
TODO explain observation