Page MenuHomePhabricator

ElasticSearch Curator forbidden to set replica count on apifeatureusage indices
Closed, ResolvedPublic3 Estimated Story Points

Description

After upgrading elasticsearch-curator to a version capable of managing the search cluster, the curator management job was forbidden from completing:

cwhite@apifeatureusage1001:~$ sudo /usr/bin/curator --config /etc/curator/production-search-codfw.yaml /etc/curator/apifeatureusage_codfw_actions.yaml
2022-01-25 17:02:13,707 INFO      Preparing Action ID: 01, "delete_indices"
2022-01-25 17:02:13,707 INFO      Creating client object and testing connection
2022-01-25 17:02:13,711 INFO      Instantiating client object
2022-01-25 17:02:13,712 INFO      Testing client connectivity
2022-01-25 17:02:13,784 INFO      Successfully created Elasticsearch client object with provided settings
2022-01-25 17:02:13,820 INFO      Trying Action ID: 01, "delete_indices": apifeatureusage: delete older than 91 days
2022-01-25 17:02:18,672 INFO      Skipping action "delete_indices" due to empty list: <class 'curator.exceptions.NoIndices'>
2022-01-25 17:02:18,676 INFO      Action ID: 01, "delete_indices" completed.
2022-01-25 17:02:18,680 INFO      Preparing Action ID: 02, "replicas"
2022-01-25 17:02:18,680 INFO      Creating client object and testing connection
2022-01-25 17:02:18,680 INFO      Instantiating client object
2022-01-25 17:02:18,681 INFO      Testing client connectivity
2022-01-25 17:02:18,750 INFO      Successfully created Elasticsearch client object with provided settings
2022-01-25 17:02:18,786 INFO      Trying Action ID: 02, "replicas": apifeatureusage: set replicas to 1 after 31 days
2022-01-25 17:02:23,736 INFO      Setting the replica count to 1 for 60 indices: ['apifeatureusage-2021.11.29', 'apifeatureusage-2021.12.05', 'apifeatureusage-2021.10.29', 'apifeatureusage-2021.11.26', 'apifeatureusage-2021.11.13', 'apifeatureusage-2021.10.31', 'apifeatureusage-2021.11.22', 'apifeatureusage-2021.12.07', 'apifeatureusage-2021.11.20', 'apifeatureusage-2021.11.02', 'apifeatureusage-2021.12.13', 'apifeatureusage-2021.12.04', 'apifeatureusage-2021.11.09', 'apifeatureusage-2021.11.06', 'apifeatureusage-2021.12.14', 'apifeatureusage-2021.12.24', 'apifeatureusage-2021.11.15', 'apifeatureusage-2021.10.27', 'apifeatureusage-2021.12.25', 'apifeatureusage-2021.11.10', 'apifeatureusage-2021.12.22', 'apifeatureusage-2021.12.01', 'apifeatureusage-2021.11.18', 'apifeatureusage-2021.11.03', 'apifeatureusage-2021.12.02', 'apifeatureusage-2021.11.19', 'apifeatureusage-2021.11.17', 'apifeatureusage-2021.10.28', 'apifeatureusage-2021.10.30', 'apifeatureusage-2021.12.16', 'apifeatureusage-2021.12.11', 'apifeatureusage-2021.12.06', 'apifeatureusage-2021.12.18', 'apifeatureusage-2021.12.17', 'apifeatureusage-2021.12.09', 'apifeatureusage-2021.12.15', 'apifeatureusage-2021.12.08', 'apifeatureusage-2021.12.19', 'apifeatureusage-2021.11.16', 'apifeatureusage-2021.11.08', 'apifeatureusage-2021.11.14', 'apifeatureusage-2021.11.12', 'apifeatureusage-2021.11.28', 'apifeatureusage-2021.11.30', 'apifeatureusage-2021.12.20', 'apifeatureusage-2021.11.23', 'apifeatureusage-2021.11.24', 'apifeatureusage-2021.12.10', 'apifeatureusage-2021.11.25', 'apifeatureusage-2021.11.04', 'apifeatureusage-2021.11.01', 'apifeatureusage-2021.12.23', 'apifeatureusage-2021.12.12', 'apifeatureusage-2021.11.05', 'apifeatureusage-2021.12.03', 'apifeatureusage-2021.12.21', 'apifeatureusage-2021.11.27', 'apifeatureusage-2021.11.21', 'apifeatureusage-2021.11.11', 'apifeatureusage-2021.11.07']
2022-01-25 17:02:23,773 ERROR     Failed to complete action: replicas.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: AuthorizationException(403, 'cluster_block_exception', 'blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];')

Event Timeline

@colewhite ACK, checking with my team on what needs to be done here.

Gehel triaged this task as High priority.Jan 31 2022, 4:12 PM

Hi folks, I have acked the alert in Icinga for apifeatureusage1001 :)

EBernhardson subscribed.

In T300944 the codfw cluster entered a low-disk situation (not available ~= low-disk). Unknown to us, in that case it set many (all?) indices that were on the failing node to a special read-only state. Last thursday, feb 10, we identified this and put all the indices back into their expected state. The alerts in icinga are no longer firing, should be nothing more to do here.

herron claimed this task.
herron subscribed.

Looks much better now, resolving!

Feb 17 00:46:13 apifeatureusage1001 curator[7845]: 2022-02-17 00:46:13,938 INFO      Job completed.