Page MenuHomePhabricator

Multiple unassigned shards on the Elasticsearch relforge cluster
Closed, ResolvedPublic

Description

Icinga reports multiple unassigned shards in the relforge cluster (see below)

AC:

  • relforge cluster is in a green state, with all shards assigned

Icinga messages:

  • CRITICAL - elasticsearch inactive shards 159 threshold >=0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 159, active_shards: 159, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 159, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0
  • CRITICAL - elasticsearch inactive shards 5 threshold >=0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 5, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0
  • CRITICAL - queries_01022021[1](2022-12-08T19:59:50.738Z), queries_01022021[3](2022-12-08T19:59:50.738Z), queries_01022021[4](2022-12-08T19:59:50.738Z), queries_01022021[2](2022-12-08T19:59:50.738Z), queries_01022021[0](2022-12-08T19:59:50.738Z), ebernhardson_test[0](2022-12-08T19:59:50.734Z), queries_24012021[1](2022-12-08T19:59:50.732Z), queries_24012021[3](2022-12-08T19:59:50.732Z), queries_24012021[4](2022-12-08T19:59:50.732Z), queries_24012021[2](2022-12-08T19:59:50.732Z), queries_24012021[0](2022-12-08T19:59:50.732Z), joined_queries-202201[1](2022-12-08T19:59:50.734Z), joined_queries-202201[2](2022-12-08T19:59:50.734Z), joined_queries-202201[0](2022-12-08T19:59:50.734Z), joined_queries-202212[1](2022-12-08T19:59:50.733Z), joined_queries-202212[2](2022-12-08T19:59:50.733Z), joined_queries-202212[0](2022-12-08T19:59:50.733Z), .ltrstore[0](2022-12-08T20:09:47.858Z), .kibana_1[0](2022-12-08T20:09:47.858Z), joined_queries-202204[1](2022-12-08T19:59:50.737Z), joined_queries-202204[2](2022-12-08T19:59:50.737Z)
  • CRITICAL - mw_cirrus_metastore[4](2022-12-08T19:59:50.729Z), mw_cirrus_metastore[3](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[2](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[1](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[0](2022-12-08T19:59:50.731Z)

Event Timeline

bking claimed this task.
bking moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

This should have been caught by myself or Ryan after the relforge rolling operation we did on Thursday, my apologies.

In the meantime, I've addressed it by removing the transient cluster allocation settings as described here .

I've also created T324982 to address this gap in our rolling operation cookbook.