Page MenuHomePhabricator

wmcs.ceph.osd.bootstrap_and_add failing to connect to rados
Closed, ResolvedPublic

Description

(spicerack) andrew@buster:~$ cookbook -c ~/.config/spicerack/cookbook_config.yaml  wmcs.ceph.osd.bootstrap_and_add --new-osd-fqdn cloudcephosd1018.eqiad.wmnet --controlling-node-fqdn cloudcephmon1001.eqiad.wmnet
START - Cookbook wmcs.ceph.osd.bootstrap_and_add
Adding new OSDs ['cloudcephosd1018.eqiad.wmnet'] to the cluster
----- OUTPUT of 'sudo -i ceph osd set norebalance' -----                                                                                            
norebalance is set                                                                                                                                  
================                                                                                                                                    
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:02<00:00,  2.78s/hosts]
FAIL |                                                                                                              |   0% (0/1) [00:02<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i ceph osd set norebalance'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
  Adding OSD cloudcephosd1018.eqiad.wmnet... (1/1)
Running Puppet with args  on 1 hosts: cloudcephosd1018.eqiad.wmnet
----- OUTPUT of 'sudo -i run-puppet-agent ' -----                                                                                                   
Info: Using configured environment 'production'                                                                                                     
Info: Retrieving pluginfacts                                                                                                                        
Info: Retrieving plugin                                                                                                                             
Info: Retrieving locales                                                                                                                            
Info: Loading facts                                                                                                                                 
Info: Caching catalog for cloudcephosd1018.eqiad.wmnet                                                                                              
Info: Applying configuration version '(f21b24e85c) Jbond - wmflib::role_hosts: fix typos'                                                           
Notice: Applied catalog in 13.80 seconds                                                                                                            
================                                                                                                                                    
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:28<00:00, 28.43s/hosts]
FAIL |                                                                                                              |   0% (0/1) [00:28<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i run-puppet-agent '.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Rebooting node cloudcephosd1018.eqiad.wmnet
----- OUTPUT of 'sudo -i ceph status -f json' -----                                                                                                 
                                                                                                                                                    
{"fsid":"5917e6d9-06a0-4928-827a-f489384975b1","health":{"status":"HEALTH_WARN","checks":{"OSDMAP_FLAGS":{"severity":"HEALTH_WARN","summary":{"message":"norebalance flag(s) set","count":11},"muted":false}},"mutes":[]},"election_epoch":502,"quorum":[0,1,2],"quorum_names":["cloudcephmon1003","cloudcephmon1002","cloudcephmon1001"],"quorum_age":2298457,"monmap":{"epoch":8,"min_mon_release_name":"octopus","num_mons":3},"osdmap":{"epoch":3788672,"num_osds":152,"num_up_osds":152,"osd_up_since":1629214787,"num_in_osds":152,"osd_in_since":1625220736,"num_remapped_pgs":0},"pgmap":{"pgs_by_state":[{"state_name":"active+clean","count":6144},{"state_name":"active+clean+scrubbing+deep","count":1}],"num_pgs":6145,"num_pools":4,"num_objects":13679379,"data_bytes":57827911614345,"bytes_used":174059778916352,"bytes_avail":117774390476800,"bytes_total":291834169393152,"read_bytes_sec":931326635,"write_bytes_sec":623458551,"read_op_per_sec":3604,"write_op_per_sec":7687},"fsmap":{"epoch":1,"by_rank":[],"up:standby":0},"mgrmap":{"available":true,"num_standbys":2,"modules":["dashboard","iostat","pg_autoscaler","prometheus","restful"],"services":{"prometheus":"http://cloudcephmon1002.eqiad.wmnet:9283/"}},"servicemap":{"epoch":404644,"modified":"2021-08-25T14:40:12.526032+0000","services":{}},"progress_events":{}}
================                                                                                                                                    
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.81s/hosts]
FAIL |                                                                                                              |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i ceph status -f json'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of 'sudo -i grep -P ...cinga/icinga.cfg' -----                                                                                         
command_file=/var/lib/icinga/rw/icinga.cmd                                                                                                          
================                                                                                                                                    
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.47s/hosts]
FAIL |                                                                                                              |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i grep -P ...cinga/icinga.cfg'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of 'sudo -i /usr/loc...loudcephosd1018"' -----                                                                                         
{"cloudcephosd1018": {"name": "cloudcephosd1018", "state": "UP", "optimal": true, "downtimed": false, "notifications_enabled": true, "failed_services": []}}                                                                                                                                            
================                                                                                                                                    
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.78s/hosts]
FAIL |                                                                                                              |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i /usr/loc...loudcephosd1018"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: cloudcephosd1018
----- OUTPUT of 'sudo -i bash -c .../rw/icinga.cmd '' -----                                                                                         
================                                                                                                                                    
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.22s/hosts]
FAIL |                                                                                                              |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i bash -c .../rw/icinga.cmd ''.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of 'sudo -i bash -c .../rw/icinga.cmd '' -----                                                                                         
================                                                                                                                                    
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.26s/hosts]
FAIL |                                                                                                              |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i bash -c .../rw/icinga.cmd ''.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Rebooting 1 hosts in batches of 1 with 0.0s of sleep in between: cloudcephosd1018.eqiad.wmnet
----- OUTPUT of 'sudo -i reboot-host' -----                                                                                                         
================                                                                                                                                    
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.16s/hosts]
FAIL |                                                                                                              |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i reboot-host'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of 'sudo -i cat /proc/uptime' -----                                                                                                    
channel 0: open failed: connect failed: Connection refused                                                                                          
stdio forwarding failed                                                                                                                             
ssh_exchange_identification: Connection closed by remote host                                                                                       
================                                                                                                                                    
PASS |                                                                                                              |   0% (0/1) [00:00<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.59hosts/s]
100.0% (1/1) of nodes failed to execute command 'sudo -i cat /proc/uptime': cloudcephosd1018.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'sudo -i cat /proc/uptime'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
[1/360, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Cumin execution failed (exit_code=2)
----- OUTPUT of 'sudo -i cat /proc/uptime' -----                                                                                                    
================                                                                                                                                    
PASS |                                                                                                              |   0% (0/1) [00:10<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:10<00:00, 10.02s/hosts]
100.0% (1/1) of nodes timeout to execute command 'sudo -i cat /proc/uptime': cloudcephosd1018.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'sudo -i cat /proc/uptime'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
[2/360, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Cumin execution failed (exit_code=2)
----- OUTPUT of 'sudo -i cat /proc/uptime' -----                                                                                                    
================                                                                                                                                    
PASS |                                                                                                              |   0% (0/1) [00:10<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:10<00:00, 10.03s/hosts]
100.0% (1/1) of nodes timeout to execute command 'sudo -i cat /proc/uptime': cloudcephosd1018.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'sudo -i cat /proc/uptime'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
[3/360, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Cumin execution failed (exit_code=2)
----- OUTPUT of 'sudo -i cat /proc/uptime' -----                                                                                                    
================                                                                                                                                    
PASS |                                                                                                              |   0% (0/1) [00:10<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:10<00:00, 10.05s/hosts]
100.0% (1/1) of nodes timeout to execute command 'sudo -i cat /proc/uptime': cloudcephosd1018.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'sudo -i cat /proc/uptime'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
[4/360, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Cumin execution failed (exit_code=2)
----- OUTPUT of 'sudo -i cat /proc/uptime' -----                                                                                                    
================                                                                                                                                    
PASS |                                                                                                              |   0% (0/1) [00:10<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:10<00:00, 10.03s/hosts]
100.0% (1/1) of nodes timeout to execute command 'sudo -i cat /proc/uptime': cloudcephosd1018.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'sudo -i cat /proc/uptime'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
[5/360, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Cumin execution failed (exit_code=2)
----- OUTPUT of 'sudo -i cat /proc/uptime' -----                                                                                                    
================                                                                                                                                    
PASS |                                                                                                              |   0% (0/1) [00:10<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:10<00:00, 10.02s/hosts]
100.0% (1/1) of nodes timeout to execute command 'sudo -i cat /proc/uptime': cloudcephosd1018.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'sudo -i cat /proc/uptime'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
[6/360, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Cumin execution failed (exit_code=2)
----- OUTPUT of 'sudo -i cat /proc/uptime' -----                                                                                                    
================                                                                                                                                    
PASS |                                                                                                              |   0% (0/1) [00:10<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:10<00:00, 10.02s/hosts]
100.0% (1/1) of nodes timeout to execute command 'sudo -i cat /proc/uptime': cloudcephosd1018.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'sudo -i cat /proc/uptime'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
[7/360, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Cumin execution failed (exit_code=2)
----- OUTPUT of 'sudo -i cat /proc/uptime' -----                                                                                                    
================                                                                                                                                    
PASS |                                                                                                              |   0% (0/1) [00:10<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:10<00:00, 10.04s/hosts]
100.0% (1/1) of nodes timeout to execute command 'sudo -i cat /proc/uptime': cloudcephosd1018.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'sudo -i cat /proc/uptime'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
[8/360, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Cumin execution failed (exit_code=2)
----- OUTPUT of 'sudo -i cat /proc/uptime' -----                                                                                                    
26.83 1197.54                                                                                                                                       
================                                                                                                                                    
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:02<00:00,  2.27s/hosts]
FAIL |                                                                                                              |   0% (0/1) [00:02<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i cat /proc/uptime'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Found reboot since 2021-08-25 14:43:06.066544 for hosts cloudcephosd1018.eqiad.wmnet
Rebooted node cloudcephosd1018.eqiad.wmnet, waiting for cluster to stabilize...
----- OUTPUT of 'sudo -i ceph status -f json' -----                                                                                                 
                                                                                                                                                    
{"fsid":"5917e6d9-06a0-4928-827a-f489384975b1","health":{"status":"HEALTH_WARN","checks":{"OSDMAP_FLAGS":{"severity":"HEALTH_WARN","summary":{"message":"norebalance flag(s) set","count":11},"muted":false}},"mutes":[]},"election_epoch":502,"quorum":[0,1,2],"quorum_names":["cloudcephmon1003","cloudcephmon1002","cloudcephmon1001"],"quorum_age":2298619,"monmap":{"epoch":8,"min_mon_release_name":"octopus","num_mons":3},"osdmap":{"epoch":3788749,"num_osds":152,"num_up_osds":152,"osd_up_since":1629214787,"num_in_osds":152,"osd_in_since":1625220736,"num_remapped_pgs":0},"pgmap":{"pgs_by_state":[{"state_name":"active+clean","count":6144},{"state_name":"active+clean+scrubbing+deep","count":1}],"num_pgs":6145,"num_pools":4,"num_objects":13677410,"data_bytes":57825754645973,"bytes_used":174052726198272,"bytes_avail":117781443194880,"bytes_total":291834169393152,"read_bytes_sec":740958192,"write_bytes_sec":260664495,"read_op_per_sec":2116,"write_op_per_sec":3856},"fsmap":{"epoch":1,"by_rank":[],"up:standby":0},"mgrmap":{"available":true,"num_standbys":2,"modules":["dashboard","iostat","pg_autoscaler","prometheus","restful"],"services":{"prometheus":"http://cloudcephmon1002.eqiad.wmnet:9283/"}},"servicemap":{"epoch":404646,"modified":"2021-08-25T14:44:29.495150+0000","services":{}},"progress_events":{}}
================                                                                                                                                    
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.70s/hosts]
FAIL |                                                                                                              |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i ceph status -f json'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Cluster stable, continuing
----- OUTPUT of 'sudo -i bash -c .../rw/icinga.cmd '' -----                                                                                         
================                                                                                                                                    
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.21s/hosts]
FAIL |                                                                                                              |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i bash -c .../rw/icinga.cmd ''.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Finished rebooting node cloudcephosd1018.eqiad.wmnet
----- OUTPUT of 'sudo -i lsblk --json' -----                                                                                                        
{                                                                                                                                                   
   "blockdevices": [                                                                                                                                
      {"name":"sda", "maj:min":"8:0", "rm":false, "size":"223.6G", "ro":false, "type":"disk", "mountpoint":null,                                    
         "children": [                                                                                                                              
            {"name":"sda1", "maj:min":"8:1", "rm":false, "size":"285M", "ro":false, "type":"part", "mountpoint":null},                              
            {"name":"sda2", "maj:min":"8:2", "rm":false, "size":"223.3G", "ro":false, "type":"part", "mountpoint":null,                             
               "children": [                                                                                                                        
                  {"name":"md0", "maj:min":"9:0", "rm":false, "size":"223.2G", "ro":false, "type":"raid1", "mountpoint":null,                       
                     "children": [                                                                                                                  
                        {"name":"vg0-root", "maj:min":"253:0", "rm":false, "size":"74.5G", "ro":false, "type":"lvm", "mountpoint":"/"},             
                        {"name":"vg0-swap", "maj:min":"253:1", "rm":false, "size":"976M", "ro":false, "type":"lvm", "mountpoint":"[SWAP]"},         
                        {"name":"vg0-srv", "maj:min":"253:2", "rm":false, "size":"103.1G", "ro":false, "type":"lvm", "mountpoint":"/srv"}           
                     ]                                                                                                                              
                  }                                                                                                                                 
               ]                                                                                                                                    
            }                                                                                                                                       
         ]                                                                                                                                          
      },                                                                                                                                            
      {"name":"sdb", "maj:min":"8:16", "rm":false, "size":"223.6G", "ro":false, "type":"disk", "mountpoint":null,                                   
         "children": [                                                                                                                              
            {"name":"sdb1", "maj:min":"8:17", "rm":false, "size":"285M", "ro":false, "type":"part", "mountpoint":null},                             
            {"name":"sdb2", "maj:min":"8:18", "rm":false, "size":"223.3G", "ro":false, "type":"part", "mountpoint":null,                            
               "children": [                                                                                                                        
                  {"name":"md0", "maj:min":"9:0", "rm":false, "size":"223.2G", "ro":false, "type":"raid1", "mountpoint":null,                       
                     "children": [                                                                                                                  
                        {"name":"vg0-root", "maj:min":"253:0", "rm":false, "size":"74.5G", "ro":false, "type":"lvm", "mountpoint":"/"},             
                        {"name":"vg0-swap", "maj:min":"253:1", "rm":false, "size":"976M", "ro":false, "type":"lvm", "mountpoint":"[SWAP]"},         
                        {"name":"vg0-srv", "maj:min":"253:2", "rm":false, "size":"103.1G", "ro":false, "type":"lvm", "mountpoint":"/srv"}           
                     ]                                                                                                                              
                  }                                                                                                                                 
               ]                                                                                                                                    
            }                                                                                                                                       
         ]                                                                                                                                          
      },                                                                                                                                            
      {"name":"sdc", "maj:min":"8:32", "rm":false, "size":"1.8T", "ro":false, "type":"disk", "mountpoint":null},                                    
      {"name":"sdd", "maj:min":"8:48", "rm":false, "size":"1.8T", "ro":false, "type":"disk", "mountpoint":null},                                    
      {"name":"sde", "maj:min":"8:64", "rm":false, "size":"1.8T", "ro":false, "type":"disk", "mountpoint":null},                                    
      {"name":"sdf", "maj:min":"8:80", "rm":false, "size":"1.8T", "ro":false, "type":"disk", "mountpoint":null},                                    
      {"name":"sdg", "maj:min":"8:96", "rm":false, "size":"1.8T", "ro":false, "type":"disk", "mountpoint":null},                                    
      {"name":"sdh", "maj:min":"8:112", "rm":false, "size":"1.8T", "ro":false, "type":"disk", "mountpoint":null},                                   
      {"name":"sdi", "maj:min":"8:128", "rm":false, "size":"1.8T", "ro":false, "type":"disk", "mountpoint":null},                                   
      {"name":"sdj", "maj:min":"8:144", "rm":false, "size":"1.8T", "ro":false, "type":"disk", "mountpoint":null}                                    
   ]                                                                                                                                                
}                                                                                                                                                   
================                                                                                                                                    
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.11s/hosts]
FAIL |                                                                                                              |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i lsblk --json'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
>>> I'm going to destroy and create a new OSD on cloudcephosd1018.eqiad.wmnet:/dev/sdc.
Type "go" to proceed or "abort" to interrupt the execution
> go
----- OUTPUT of 'sudo -i ceph-vol...lvm zap /dev/sdc' -----                                                                                         
--> Zapping: /dev/sdc                                                                                                                               
--> --destroy was not specified, but zapping a whole device will remove the partition table                                                         
Running command: /usr/bin/dd if=/dev/zero of=/dev/sdc bs=1M count=10 conv=fsync                                                                     
 stderr: 10+0 records in                                                                                                                            
10+0 records out                                                                                                                                    
 stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.0286299 s, 366 MB/s                                                                               
--> Zapping successful for: <Raw Device: /dev/sdc>                                                                                                  
================                                                                                                                                    
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.76s/hosts]
FAIL |                                                                                                              |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i ceph-vol...lvm zap /dev/sdc'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of 'sudo -i ceph-vol... --data /dev/sdc' -----                                                                                         
Running command: /usr/bin/ceph-authtool --gen-print-key                                                                                             
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new f23684e6-db0e-4fa7-bfe9-3a3516e0a852                                                                                                                          
 stderr: [errno 13] RADOS permission denied (error connecting to the cluster)                                                                       
-->  RuntimeError: Unable to create a new OSD id                                                                                                    
================                                                                                                                                    
PASS |                                                                                                              |   0% (0/1) [00:01<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.71s/hosts]
100.0% (1/1) of nodes failed to execute command 'sudo -i ceph-vol... --data /dev/sdc': cloudcephosd1018.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'sudo -i ceph-vol... --data /dev/sdc'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Exception raised while executing cookbook wmcs.ceph.osd.bootstrap_and_add:
Traceback (most recent call last):
  File "/home/andrew/.virtualenvs/spicerack/lib/python3.7/site-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/home/andrew/cookbooks/cookbooks/wmcs/ceph/osd/bootstrap_and_add.py", line 159, in run
    interactive=(not self.yes_i_know)
  File "/home/andrew/cookbooks/cookbooks/wmcs/__init__.py", line 547, in add_all_available_devices
    self.initialize_and_start_osd(device_path=device_path)
  File "/home/andrew/cookbooks/cookbooks/wmcs/__init__.py", line 538, in initialize_and_start_osd
    self._node.run_sync(f"ceph-volume lvm create --bluestore --data {device_path}")
  File "/home/andrew/.virtualenvs/spicerack/lib/python3.7/site-packages/spicerack/remote.py", line 477, in run_sync
    is_safe=is_safe,
  File "/home/andrew/.virtualenvs/spicerack/lib/python3.7/site-packages/spicerack/remote.py", line 668, in _execute
    raise RemoteExecutionError(ret, "Cumin execution failed")
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99)

Event Timeline

for some reason the keyring is empty:

root@cloudcephosd1018:~# wc /var/lib/ceph/bootstrap-osd/ceph.keyring
0 0 0 /var/lib/ceph/bootstrap-osd/ceph.keyring

Indeed there's something going on with puppet and the creation of the keyring.

When running puppet with --test --debug --verbose --tag /var/lib/ceph/bootstrap-osd/ceph.keyring, you can see it's
skipping the creation of the token due to the file existing:

Debug: /Stage[main]/Profile::Ceph::Osd/Ceph::Keyring[client.bootstrap-osd]/Exec[ceph-keyring-client.bootstrap-osd]: '/usr/bin/ceph-authtool --create-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring                             -n client.bootstrap-osd --add-key=... ' won't be executed because of failed check
'creates'

Tried removing the keyring file, and rerunning, and it executed it, but the file it created was empty.

Tried running the command manually and it worked and the file had contents, so it seems to be related to the
environment that puppet runs the command in, looking...

Hmm... yep, I think I know what's going on, when creating the keyring we don't put a dependency between the exec that
populates the file, and the file resource itself, and in this case, it ends up creating the file empty first, and when
the exec comes in, it sees the file there already and just skips.
Will send a patch.

Change 715455 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] ceph: fix keyring race condition

https://gerrit.wikimedia.org/r/715455

Change 715455 merged by Andrew Bogott:

[operations/puppet@production] ceph: fix keyring race condition

https://gerrit.wikimedia.org/r/715455