Page MenuHomePhabricator

openstack: codfw1dev: fullstack tests failing
Closed, ResolvedPublic

Description

New fullstack VMs in codfw1dev aren't getting the correct hostname:

hostname -f

buildvm-c94b3261-fc09-4c3b-9386-8e0503446325.admin.codfw1dev.wikimedia.cloud

that's the original hostname of the base VM. This could be a defect in the base image, but when I use that base image in other contexts it seems to work so my guess is this is some kind of dhcp failure related to the new virtual network.

Event Timeline

aborrero changed the task status from Open to In Progress.Nov 19 2024, 9:30 AM
aborrero triaged this task as Medium priority.
aborrero moved this task from Backlog to Doing on the User-aborrero board.

Mentioned in SAL (#wikimedia-cloud) [2024-11-19T10:24:25Z] <arturo> [codfw1dev] restart rabbitmq and nova/neutron services for T380208

it seems the problem is nova is not finding hypervisors:

'No valid host was found. There are not enough hosts available.',

aborrero@cloudcontrol2005-dev:~ $ sudo wmcs-openstack hypervisor list
+--------------------------------------+-------------------------------+-----------------+--------------+-------+
| ID                                   | Hypervisor Hostname           | Hypervisor Type | Host IP      | State |
+--------------------------------------+-------------------------------+-----------------+--------------+-------+
| 88eb7d37-8ea1-4bd4-97b3-3be5a9e16ede | cloudvirt2004-dev.codfw.wmnet | QEMU            | 10.192.20.31 | down  |
| 7005e8be-9987-4c7f-94d1-98879d1760cd | cloudvirt2005-dev.codfw.wmnet | QEMU            | 10.192.20.32 | down  |
| 79cba5df-5f34-4f04-b0bb-99d650b5fbc4 | cloudvirt2006-dev.codfw.wmnet | QEMU            | 10.192.20.33 | down  |
+--------------------------------------+-------------------------------+-----------------+--------------+-------+

I see this rabbitmq crash with aborrero@cloudcontrol2005-dev:~$ sudo tail -f /var/log/rabbitmq/rabbit@rabbitmq03.codfw1dev.wikimediacloud.org.log

2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0> handle_leader err {'EXIT',
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                       {{case_clause,undefined},
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                        [{ra_server,make_rpc_effect,4,
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                             [{file,"src/ra_server.erl"},{line,1677}]},
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                         {ra_server,'-make_pipelined_rpc_effects/3-fun-0-',8,
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                             [{file,"src/ra_server.erl"},{line,1633}]},
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                         {maps,fold_1,3,[{file,"maps.erl"},{line,411}]},
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                         {ra_server,handle_leader,2,
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                             [{file,"src/ra_server.erl"},{line,536}]},
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                         {ra_server_proc,handle_leader,2,
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                             [{file,"src/ra_server_proc.erl"},{line,1003}]},
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                         {ra_server_proc,leader,3,
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                             [{file,"src/ra_server_proc.erl"},{line,467}]},
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                         {gen_statem,loop_state_callback,11,
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                             [{file,"gen_statem.erl"},{line,1426}]},
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                         {proc_lib,init_p_do_apply,3,
2024-11-19 10:48:12.714735+00:00 [error] <0.28225.0>                             [{file,"proc_lib.erl"},{line,240}]}]}}
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0> ** State machine '%2F_compute.cloudvirt2005-dev' terminating
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0> ** Last event = {info,{ra_log_event,{written,{15,15,33}}}}
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0> ** When server state  = [{id,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                           {'%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                            'rabbit@rabbitmq03.codfw1dev.wikimediacloud.org'}},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                          {opt,terminate},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                          {raft_state,leader},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                          {leader_last_seen,undefined},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                          {num_pending_commands,0},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                          {num_delayed_commands,0},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                          {num_pending_applied_notifications,0},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                          {election_timeout_set,false},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                          {ra_server_state,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                           #{aux =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                              {aux_v2,'%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                               {empty,true},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                               {inactive,-576460589304070,1,1.0},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                               {aux_gc,0},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                               undefined,undefined},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             cluster =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                              #{{'%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                 'rabbit@rabbitmq01.codfw1dev.wikimediacloud.org'} =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                 #{commit_index_sent => 14,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   match_index => 334245,next_index => 334246,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   query_index => 0,status => normal},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                {'%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                 'rabbit@rabbitmq02.codfw1dev.wikimediacloud.org'} =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                 #{commit_index_sent => 14,match_index => 14,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   next_index => 16,query_index => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   status => normal},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                {'%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                 'rabbit@rabbitmq03.codfw1dev.wikimediacloud.org'} =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                 #{commit_index_sent => 0,match_index => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   next_index => 1,query_index => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   status => normal}},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             commit_index => 14,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             counter =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                              {write_concurrency,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                               #Ref<0.830461912.3533307912.35833>},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             current_term => 33,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             effective_machine_module => rabbit_fifo,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             effective_machine_version => 2,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             id =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                              {'%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                               'rabbit@rabbitmq03.codfw1dev.wikimediacloud.org'},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             last_applied => 14,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             leader_id =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                              {'%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                               'rabbit@rabbitmq03.codfw1dev.wikimediacloud.org'},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             log =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                              #{cache_size => 1,first_index => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                last_index => 15,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                last_written_index_term => {14,33},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                num_segments => 1,open_segments => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                snapshot_index => undefined,type => ra_log},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             log_id =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                              "queue 'compute.cloudvirt2005-dev' in vhost '/'",
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             machine =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                              #{checkout_message_bytes => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                config =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                 #{consumer_strategy => competing,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   dead_lettering_enabled => false,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   delivery_limit => undefined,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   expires => undefined,max_bytes => undefined,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   max_length => undefined,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   msg_ttl => undefined,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   name => '%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   release_cursor_interval => {2048,2048},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   resource =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                    {resource,<<"/">>,queue,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                     <<"compute.cloudvirt2005-dev">>}},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                discard_checkout_message_bytes => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                discard_message_bytes => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                enqueue_message_bytes => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                in_memory_message_bytes => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                num_checked_out => 0,num_consumers => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                num_discard_checked_out => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                num_discarded => 0,num_enqueuers => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                num_in_memory_ready_messages => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                num_messages => 0,num_ready_messages => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                num_release_cursors => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                release_cursor_enqueue_counter => 0,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                release_cursors => [],
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                smallest_raft_index => undefined,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                type => rabbit_fifo},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             machine_version => 2,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             machine_versions => [{1,2},{0,0}],
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             max_pipeline_count => 4096,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             metrics_key =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                              {resource,<<"/">>,queue,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                               <<"compute.cloudvirt2005-dev">>},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             system_config =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                              #{data_dir =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                 "/var/lib/rabbitmq/mnesia/rabbit@rabbitmq03.codfw1dev.wikimediacloud.org/quorum/rabbit@rabbitmq03.codfw1dev.wikimediacloud.org",
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                name => quorum_queues,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                names =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                 #{closed_mem_tbls => ra_log_closed_mem_tables,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   directory => ra_directory,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   directory_rev => ra_directory_reverse,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   log_ets => ra_log_ets,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   log_meta => ra_log_meta,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   log_sup => ra_log_sup,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   open_mem_tbls => ra_log_open_mem_tables,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   segment_writer => ra_log_segment_writer,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   server_sup => ra_server_sup_sup,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   wal => ra_log_wal,wal_sup => ra_log_wal_sup},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                segment_compute_checksums => true,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                segment_max_entries => 4096,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                wal_compute_checksums => true,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                wal_data_dir =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                 "/var/lib/rabbitmq/mnesia/rabbit@rabbitmq03.codfw1dev.wikimediacloud.org/quorum/rabbit@rabbitmq03.codfw1dev.wikimediacloud.org",
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                wal_garbage_collect => false,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                wal_max_batch_size => 4096,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                wal_max_entries => undefined,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                wal_max_size_bytes => 536870912,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                wal_pre_allocate => false,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                wal_sync_method => datasync,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                wal_write_strategy => default},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             uid => <<"2F_COMCC4T8HY3Q67O">>,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                             voted_for =>
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                              {'%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                               'rabbit@rabbitmq03.codfw1dev.wikimediacloud.org'}}}]
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0> ** Reason for termination = exit:{'EXIT',
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                   {{case_clause,undefined},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                    [{ra_server,make_rpc_effect,4,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                      [{file,"src/ra_server.erl"},{line,1677}]},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                     {ra_server,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                      '-make_pipelined_rpc_effects/3-fun-0-',8,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                      [{file,"src/ra_server.erl"},{line,1633}]},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                     {maps,fold_1,3,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                      [{file,"maps.erl"},{line,411}]},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                     {ra_server,handle_leader,2,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                      [{file,"src/ra_server.erl"},{line,536}]},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                     {ra_server_proc,handle_leader,2,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                      [{file,"src/ra_server_proc.erl"},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                       {line,1003}]},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                     {ra_server_proc,leader,3,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                      [{file,"src/ra_server_proc.erl"},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                       {line,467}]},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                     {gen_statem,loop_state_callback,11,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                      [{file,"gen_statem.erl"},{line,1426}]},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                     {proc_lib,init_p_do_apply,3,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                                      [{file,"proc_lib.erl"},{line,240}]}]}}
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0> ** Callback modules = [ra_server_proc]
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0> ** Callback mode = [state_functions,state_enter]
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0> ** Stacktrace =
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0> **  [{ra_server_proc,handle_leader,2,
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>                      [{file,"src/ra_server_proc.erl"},{line,1010}]},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>      {ra_server_proc,leader,3,[{file,"src/ra_server_proc.erl"},{line,467}]},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>      {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1426}]},
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0> ** Time-outs: {1,[{{timeout,tick},tick_timeout}]}
2024-11-19 10:48:12.715396+00:00 [error] <0.28225.0> 
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>   crasher:
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     initial call: ra_server_proc:init/1
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     pid: <0.28225.0>
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     registered_name: '%2F_compute.cloudvirt2005-dev'
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     exception exit: {'EXIT',
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                         {{case_clause,undefined},
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                          [{ra_server,make_rpc_effect,4,
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                               [{file,"src/ra_server.erl"},{line,1677}]},
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                           {ra_server,'-make_pipelined_rpc_effects/3-fun-0-',
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                               8,
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                               [{file,"src/ra_server.erl"},{line,1633}]},
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                           {maps,fold_1,3,[{file,"maps.erl"},{line,411}]},
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                           {ra_server,handle_leader,2,
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                               [{file,"src/ra_server.erl"},{line,536}]},
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                           {ra_server_proc,handle_leader,2,
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                               [{file,"src/ra_server_proc.erl"},{line,1003}]},
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                           {ra_server_proc,leader,3,
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                               [{file,"src/ra_server_proc.erl"},{line,467}]},
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                           {gen_statem,loop_state_callback,11,
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                               [{file,"gen_statem.erl"},{line,1426}]},
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                           {proc_lib,init_p_do_apply,3,
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                               [{file,"proc_lib.erl"},{line,240}]}]}}
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>       in function  ra_server_proc:handle_leader/2 (src/ra_server_proc.erl, line 1010)
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>       in call from ra_server_proc:leader/3 (src/ra_server_proc.erl, line 467)
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>       in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 1426)
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     ancestors: [<0.1171.0>,ra_server_sup_sup,<0.457.0>,ra_systems_sup,
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                   ra_sup,<0.193.0>]
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     message_queue_len: 0
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     messages: []
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     links: [<0.1171.0>]
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     dictionary: [{rand_seed,{#{bits => 58,jump => #Fun<rand.3.34006561>,
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                                 next => #Fun<rand.0.34006561>,type => exsss,
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                                 uniform => #Fun<rand.1.34006561>,
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                                 uniform_n => #Fun<rand.2.34006561>},
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>                               [195694929694019114|246777113946586207]}}]
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     trap_exit: true
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     status: running
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     heap_size: 46422
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     stack_size: 28
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>     reductions: 244326
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0>   neighbours:
2024-11-19 10:48:12.719902+00:00 [error] <0.28225.0> 
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>     supervisor: {<0.1171.0>,ra_server_sup}
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>     errorContext: child_terminated
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>     reason: {'EXIT',{{case_clause,undefined},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                      [{ra_server,make_rpc_effect,4,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                                  [{file,"src/ra_server.erl"},{line,1677}]},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                       {ra_server,'-make_pipelined_rpc_effects/3-fun-0-',8,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                                  [{file,"src/ra_server.erl"},{line,1633}]},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                       {maps,fold_1,3,[{file,"maps.erl"},{line,411}]},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                       {ra_server,handle_leader,2,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                                  [{file,"src/ra_server.erl"},{line,536}]},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                       {ra_server_proc,handle_leader,2,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                                       [{file,"src/ra_server_proc.erl"},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                                        {line,1003}]},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                       {ra_server_proc,leader,3,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                                       [{file,"src/ra_server_proc.erl"},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                                        {line,467}]},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                       {gen_statem,loop_state_callback,11,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                                   [{file,"gen_statem.erl"},{line,1426}]},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                       {proc_lib,init_p_do_apply,3,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                                 [{file,"proc_lib.erl"},{line,240}]}]}}
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>     offender: [{pid,<0.28225.0>},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                {id,'%2F_compute.cloudvirt2005-dev'},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                {mfargs,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                 {ra_server_proc,start_link,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                  [#{await_condition_timeout => 30000,broadcast_time => 100,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                     cluster_name => '%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                     friendly_name =>
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                      "queue 'compute.cloudvirt2005-dev' in vhost '/'",
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                     has_changed => false,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                     id =>
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                      {'%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                       'rabbit@rabbitmq03.codfw1dev.wikimediacloud.org'},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                     initial_members =>
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                      [{'%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        'rabbit@rabbitmq02.codfw1dev.wikimediacloud.org'},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                       {'%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        'rabbit@rabbitmq03.codfw1dev.wikimediacloud.org'},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                       {'%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        'rabbit@rabbitmq01.codfw1dev.wikimediacloud.org'}],
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                     install_snap_rpc_timeout => 120000,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                     log_init_args =>
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                      #{snapshot_interval => 8192,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        uid => <<"2F_COMCC4T8HY3Q67O">>},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                     machine =>
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                      {module,rabbit_fifo,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                       #{become_leader_handler =>
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                          {rabbit_quorum_queue,become_leader,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                           [{resource,<<"/">>,queue,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                             <<"compute.cloudvirt2005-dev">>}]},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         created => 1732011857975,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         dead_letter_handler => undefined,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         delivery_limit => undefined,expires => undefined,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         max_bytes => undefined,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         max_in_memory_bytes => undefined,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         max_in_memory_length => undefined,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         max_length => undefined,msg_ttl => undefined,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         name => '%2F_compute.cloudvirt2005-dev',
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         overflow_strategy => drop_head,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         queue_resource =>
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                          {resource,<<"/">>,queue,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                           <<"compute.cloudvirt2005-dev">>},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         single_active_consumer_on => false}},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                     metrics_key =>
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                      {resource,<<"/">>,queue,<<"compute.cloudvirt2005-dev">>},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                     parent => <0.1171.0>,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                     ra_event_formatter =>
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                      {rabbit_quorum_queue,format_ra_event,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                       [{resource,<<"/">>,queue,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         <<"compute.cloudvirt2005-dev">>}]},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                     system_config =>
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                      #{data_dir =>
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         "/var/lib/rabbitmq/mnesia/rabbit@rabbitmq03.codfw1dev.wikimediacloud.org/quorum/rabbit@rabbitmq03.codfw1dev.wikimediacloud.org",
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        name => quorum_queues,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        names =>
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         #{closed_mem_tbls => ra_log_closed_mem_tables,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                           directory => ra_directory,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                           directory_rev => ra_directory_reverse,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                           log_ets => ra_log_ets,log_meta => ra_log_meta,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                           log_sup => ra_log_sup,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                           open_mem_tbls => ra_log_open_mem_tables,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                           segment_writer => ra_log_segment_writer,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                           server_sup => ra_server_sup_sup,wal => ra_log_wal,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                           wal_sup => ra_log_wal_sup},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        segment_compute_checksums => true,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        segment_max_entries => 4096,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        wal_compute_checksums => true,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        wal_data_dir =>
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                         "/var/lib/rabbitmq/mnesia/rabbit@rabbitmq03.codfw1dev.wikimediacloud.org/quorum/rabbit@rabbitmq03.codfw1dev.wikimediacloud.org",
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        wal_garbage_collect => false,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        wal_max_batch_size => 4096,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        wal_max_entries => undefined,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        wal_max_size_bytes => 536870912,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        wal_pre_allocate => false,wal_sync_method => datasync,
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                        wal_write_strategy => default},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                     tick_timeout => 5000,uid => <<"2F_COMCC4T8HY3Q67O">>}]}},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                {restart_type,transient},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                {significant,false},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                {shutdown,5000},
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>                {child_type,worker}]
2024-11-19 10:48:12.721485+00:00 [error] <0.1171.0>
aborrero renamed this task from openstack: nova-fullstack failing in codfw1dev to openstack: codfw1dev: rabbitmq is crashing.Nov 19 2024, 10:58 AM
aborrero@cloudcontrol2005-dev:~ $ sudo rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
2024-11-19 11:13:37 There are 5900 processes.
2024-11-19 11:13:37 Investigated 3 processes this round, 5000ms to go.
2024-11-19 11:13:37 Investigated 3 processes this round, 4500ms to go.
2024-11-19 11:13:38 Investigated 3 processes this round, 4000ms to go.
2024-11-19 11:13:38 Investigated 3 processes this round, 3500ms to go.
2024-11-19 11:13:39 Investigated 3 processes this round, 3000ms to go.
2024-11-19 11:13:39 Investigated 3 processes this round, 2500ms to go.
2024-11-19 11:13:40 Investigated 3 processes this round, 2000ms to go.
2024-11-19 11:13:40 Investigated 3 processes this round, 1500ms to go.
2024-11-19 11:13:41 Investigated 3 processes this round, 1000ms to go.
2024-11-19 11:13:41 Investigated 2 processes this round, 500ms to go.
2024-11-19 11:13:42 Found 2 suspicious processes.
2024-11-19 11:13:42 [{pid,<12307.59.0>},
                     {registered_name,global_group_check},
                     {current_stacktrace,
                         [{global_group,global_group_check_dispatcher,0,
                              [{file,"global_group.erl"},{line,1419}]}]},
                     {initial_call,{erlang,apply,2}},
                     {message_queue_len,0},
                     {links,[<12307.58.0>]},
                     {monitors,[]},
                     {monitored_by,[]},
                     {heap_size,233}]
2024-11-19 11:13:42 [{pid,<12307.23732.15>},
                     {registered_name,[]},
                     {current_stacktrace,
                         [{gen,do_call,4,[{file,"gen.erl"},{line,256}]},
                          {gen_statem,call_dirty,4,
                              [{file,"gen_statem.erl"},{line,900}]},
                          {ra_server_proc,read_chunks_and_send_rpc,7,
                              [{file,"src/ra_server_proc.erl"},{line,1540}]},
                          {ra_server_proc,send_snapshots,8,
                              [{file,"src/ra_server_proc.erl"},{line,1523}]},
                          {ra_server_proc,'-handle_effect/5-fun-1-',8,
                              [{file,"src/ra_server_proc.erl"},{line,1185}]}]},
                     {initial_call,{erlang,apply,2}},
                     {message_queue_len,0},
                     {links,[]},
                     {monitors,
                         [{process,
                              {'%2F_q-plugin',
                                  'rabbit@rabbitmq01.codfw1dev.wikimediacloud.org'}}]},
                     {monitored_by,[<12307.1549.0>]},
                     {heap_size,2586}]
ok

rabbitmq-server on cloudcontrol2006-dev had severe problems:

aborrero@cloudcontrol2006-dev:~ $ sudo rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' | wc -l
12599

the cluster is just in split brain mode:

aborrero@cloudcumin2001:~ $ sudo cumin cloudcontrol200[4-6]* 'rabbitmqctl cluster_status | grep -C 4 "Running Nodes"'
3 hosts will be targeted:
cloudcontrol[2004-2006]-dev.codfw.wmnet
OK to proceed on 3 hosts? Enter the number of affected hosts to confirm or "q" to quit: 3
===== NODE GROUP =====                                                                                                                                                                                                                                                                      
(1) cloudcontrol2006-dev.codfw.wmnet                                                                                                                                                                                                                                                        
----- OUTPUT of 'rabbitmqctl clus... "Running Nodes"' -----                                                                                                                                                                                                                                 
Disk Nodes                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                            
rabbit@rabbitmq01.codfw1dev.wikimediacloud.org

Running Nodes

rabbit@rabbitmq01.codfw1dev.wikimediacloud.org

Versions
===== NODE GROUP =====                                                                                                                                                                                                                                                                      
(2) cloudcontrol[2004-2005]-dev.codfw.wmnet                                                                                                                                                                                                                                                 
----- OUTPUT of 'rabbitmqctl clus... "Running Nodes"' -----                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                            
rabbit@rabbitmq02.codfw1dev.wikimediacloud.org                                                                                                                                                                                                                                              
rabbit@rabbitmq03.codfw1dev.wikimediacloud.org

Running Nodes

rabbit@rabbitmq02.codfw1dev.wikimediacloud.org
rabbit@rabbitmq03.codfw1dev.wikimediacloud.org

Mentioned in SAL (#wikimedia-cloud) [2024-11-19T11:39:15Z] <arturo> [codfw1dev] performing rabbit full reset T380208

Despite the full reset, having corrected the split brain, I still see nova-compute complaining.

AMQP server on rabbitmq01.codfw1dev.wikimediacloud.org:5671 is unreachable: . Trying again in 0 seconds.: amqp.exceptions.MessageNacked

I have checked TCP ports are open:

aborrero@cloudcumin2001:~$ sudo cumin cloudvirt2* 'for i in $(seq 1 3) ; do nc -zv rabbitmq0${i}.codfw1dev.wikimediacloud.org 5671 ; done'
3 hosts will be targeted:
cloudvirt[2004-2006]-dev.codfw.wmnet
OK to proceed on 3 hosts? Enter the number of affected hosts to confirm or "q" to quit: 3
===== NODE GROUP =====                                                                                                                       
(3) cloudvirt[2004-2006]-dev.codfw.wmnet                                                                                                     
----- OUTPUT of 'for i in $(seq 1....org 5671 ; done' -----                                                                                  
Connection to rabbitmq01.codfw1dev.wikimediacloud.org (172.20.5.22) 5671 port [tcp/amqps] succeeded!                                         
Connection to rabbitmq02.codfw1dev.wikimediacloud.org (172.20.5.6) 5671 port [tcp/amqps] succeeded!                                          
Connection to rabbitmq03.codfw1dev.wikimediacloud.org (172.20.5.7) 5671 port [tcp/amqps] succeeded!
Andrew renamed this task from openstack: codfw1dev: rabbitmq is crashing to openstack: codfw1dev: fullstack tests failing.Nov 19 2024, 5:38 PM

I'm not sure how rabbitmq got into this state but I explicitly had it forget a pool member and reset everything and it seems to be working properly now.

The current error I'm seeing is with DNS, somehow. The test agent reports a lack of a ptr record, and indeed I can see that too:

labtestandrew@bastion-codfw1dev-04:~$ dig +short AAAA fullstackd-20241120002121.admin-monitoring.codfw1dev.wikimedia.cloud
2a02:ec80:a100:1::82

labtestandrew@bastion-codfw1dev-04:~$ dig +short -x AAAA 2a02:ec80:a100:1::82
<no response>

and yet, designate can see it:

| 0e432768-3451-4431-8e84-528136cc48c5 |                      | 2.8.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1.0.0. | PTR  | fullstackd-20241120002121.admin-         | ACTIVE | NONE   |
|                                      |                      | 0.0.0.1.a.0.8.c.e.2.0.a.2.ip6.arpa.    |      | monitoring.codfw1dev.wikimedia.cloud.    |        |        |

(ugly wrapping but it's there.)

So... maybe something with designate<->pdns ?

there was a problem with the bastion ssh key:

aborrero@cloudcontrol2005-dev:~$ sudo /usr/bin/ssh -o ConnectTimeout=5 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o NumberOfPasswordPrompts=0 -o LogLevel=ERROR -o ProxyCommand="ssh -o StrictHostKeyChecking=no -i /var/lib/osstackcanary/osstackcanary_id -W %h:%p osstackcanary@185.15.57.2" -i /var/lib/osstackcanary/osstackcanary_id osstackcanary@172.16.129.231 /bin/true
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:aHxjyrXw4wygzJKA/ObYOtyam/iNBa0S6B5mOd6kVgo.
Please contact your system administrator.
Add correct host key in /root/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /root/.ssh/known_hosts:3
  remove with:
  ssh-keygen -f "/root/.ssh/known_hosts" -R "185.15.57.2"
Password authentication is disabled to avoid man-in-the-middle attacks.
Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
UpdateHostkeys is disabled because the host key is not trusted.
Error: forwarding disabled due to host key check failure
kex_exchange_identification: Connection closed by remote host

another problem, the puppetmaster is not accepting the SSH connection from designate wmfsink for cleanup:

aborrero@cloudcontrol2005-dev:~ $ sudo /usr/bin/ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -lcertmanager 185.15.57.4 sudo /usr/bin/puppetserver ca clean --certname fullstackd-20241120092500.admin-monitoring.codfw1dev.wikimedia.cloud
Warning: Permanently added '185.15.57.4' (ED25519) to the list of known hosts.
certmanager@185.15.57.4: Permission denied (publickey).

server side logs:

2024-11-20T11:52:00.958571+00:00 cloudinfra-cloudvps-puppetserver-1 sshd[3821940]: Connection from 172.20.5.7 port 54050 on 172.16.128.65 port 22 rdomain ""
2024-11-20T11:52:01.041589+00:00 cloudinfra-cloudvps-puppetserver-1 sshd[3821940]: Invalid user certmanager from 172.20.5.7 port 54050
2024-11-20T11:52:01.051657+00:00 cloudinfra-cloudvps-puppetserver-1 sshd[3821940]: Connection closed by invalid user certmanager 172.20.5.7 port 54050 [preauth]

There could be problems with the puppet role.

  • In eqiad1, project cloudinfra, puppet prefix cloudinfra-cloudvps-puppetserver has role role::puppetserver::cloud_vps_global, which includes profile::openstack::base::puppetserver::cert_cleaning (which includes the certmanager user)
  • In codfw1dev, project cloudinfra-codfw1dev, puppet prefix cloudinfra-cloudvps-puppetserver has role role::puppetserver::cloud_vps_project which does not include the above cert_cleaning profile, thus it doesn't have the certmanager user.

Corrected discrepancy in the puppet role by using role::puppetserver::cloud_vps_global on the codfw1dev prefix.

The puppetserver still rejects the ssh connection from wmfsink.

My theory is that wmfsink refusing to do this cleanup is somehow affecting the whole designate-sink pipeline from succeeding (thus, creating PTR records), which is the actual failure on the nova-fullstack process.

Corrected discrepancy in the puppet role by using role::puppetserver::cloud_vps_global on the codfw1dev prefix.

The puppetserver still rejects the ssh connection from wmfsink.

My theory is that wmfsink refusing to do this cleanup is somehow affecting the whole designate-sink pipeline from succeeding (thus, creating PTR records), which is the actual failure on the nova-fullstack process.

This works now after the puppet role change, I was looking at stale logs.

the VMs get the right hostname, and can run puppet correctly:

root@fullstackd-20241120124330:~# run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for fullstackd-20241120124330.admin-monitoring.codfw1dev.wikimedia.cloud
Info: Applying configuration version '(5a5516b137) Gerrit Code Review - Revert "haproxykafka: working on TLS client authentication to kafka"'
Notice: Applied catalog in 4.13 seconds

I have observed the following:

image.png (2,551×318 px, 109 KB)

  • in the next fullstack loop, with a new VM, our PTR designate-sink code wont be called again, and verify_dns_reverse will fail

I noticed the other plugin wmf_sink may be trying to remove the same DNS records, so there could be conflicts?

I have the suspicion that only cloudcontrol2004-dev is executing the designate sink plugin code.

I have the suspicion that only cloudcontrol2004-dev is executing the designate sink plugin code.

Per the logs in logstash this is confirmed to be true.

The reasons could be:

  • rabbitmq misbehaving, so notifications are only correctly arriving at one node
  • designate misconfiguration, so somehow we are only telling the system to use cloudcontrol2004-dev to process events

Change #1093867 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: designate: fix installation path for designate-sink plugin

https://gerrit.wikimedia.org/r/1093867

Change #1093867 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: designate: fix installation path for designate-sink plugin

https://gerrit.wikimedia.org/r/1093867

nova-fullstack works normally now.