Page MenuHomePhabricator

Curator fails to complete regularly
Open, MediumPublicBUG REPORT

Description

Logs since late April:

logstash2026

$ sudo journalctl -e -n 100000 -u curator_actions_cluster_wide | grep 'Failed to complete action'
Apr 29 00:42:53 logstash2026 curator[1203410]: 2024-04-29 00:42:53,671 ERROR     Failed to complete action: delete_indices.  <class 'KeyError'>: 'store'
Apr 29 14:29:34 logstash2026 curator[1322941]: 2024-04-29 14:29:34,263 ERROR     Failed to complete action: forcemerge.  <class 'KeyError'>: 'store'
May 01 00:42:34 logstash2026 curator[1631611]: 2024-05-01 00:42:34,934 ERROR     Failed to complete action: delete_indices.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: NotFoundError(404, 'index_not_found_exception', 'no such index [logstash-default-1-7.0.0-1-2024.01.31]', logstash-default-1-7.0.0-1-2024.01.31, index_or_alias)
May 02 17:03:36 logstash2026 curator[1840558]: 2024-05-02 17:03:36,497 ERROR     Failed to complete action: forcemerge.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeout(HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=21600))
May 03 00:43:40 logstash2026 curator[2044019]: 2024-05-03 00:43:40,457 ERROR     Failed to complete action: replicas.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeout(HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=30))
May 04 00:43:35 logstash2026 curator[2244850]: 2024-05-04 00:43:35,224 ERROR     Failed to complete action: replicas.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeout(HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=30))
May 09 08:59:56 logstash2026 curator[3378863]: 2024-05-09 08:59:56,177 ERROR     Failed to complete action: forcemerge.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeout(HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=21600))
May 15 00:44:04 logstash2026 curator[462616]: 2024-05-15 00:44:04,782 ERROR     Failed to complete action: forcemerge.  <class 'KeyError'>: 'store'
May 17 00:42:49 logstash2026 curator[921785]: 2024-05-17 00:42:49,442 ERROR     Failed to complete action: delete_indices.  <class 'KeyError'>: 'store'
May 17 14:09:23 logstash2026 curator[1052851]: 2024-05-17 14:09:23,551 ERROR     Failed to complete action: replicas.  <class 'KeyError'>: 'store'

logstash1026

$ sudo journalctl -e -n 100000 -u curator_actions_cluster_wide | grep 'Failed to complete action'
May 02 12:47:43 logstash1026 curator[2788572]: 2024-05-02 12:47:43,213 ERROR     Failed to complete action: forcemerge.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeout(HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=21600))
May 09 12:18:08 logstash1026 curator[88855]: 2024-05-09 12:18:08,629 ERROR     Failed to complete action: forcemerge.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeout(HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=21600))
cwhite@logstash1026:~$ sudo journalctl -e -n 1000000 -u curator_actions_cluster_wide | grep 'Failed to complete action'
May 02 12:47:43 logstash1026 curator[2788572]: 2024-05-02 12:47:43,213 ERROR     Failed to complete action: forcemerge.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeout(HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=21600))
May 09 12:18:08 logstash1026 curator[88855]: 2024-05-09 12:18:08,629 ERROR     Failed to complete action: forcemerge.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeout(HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=21600))

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2024-05-04T03:06:11Z] <denisse> Enable log level DEBUG for curator on logstash2026 - T364190

Mentioned in SAL (#wikimedia-operations) [2024-05-04T03:07:02Z] <denisse> Restarting status curator_actions_cluster_wide.service to log with DEBUGG level on logstash2026 - T364190

colewhite renamed this task from Curator Failed to complete action: replicas to Curator fails to complete regularly.Thu, May 9, 4:30 PM
colewhite triaged this task as Medium priority.
colewhite updated the task description. (Show Details)

Curator is failing for a number of reasons, but most of them are timeouts. Usually this is caused by delays induced by other cluster operations running simultaneously but others (e.g. forcemerge) may be due to overly-large indexes.

One possibility for addressing forcemerge timeouts is to identify large indexes and try to split into smaller indexes. Forcemerge happens only once and early on in the index lifecycle.


I wonder if it would be worthwhile to split the main curator run into smaller units of work. One command can continue long after the command was accepted and begins the cluster operation.

The job split could look like:

  • delete indices
  • set replicas
  • disktype reassignment
  • forcemerge

By splitting into separate units, we can:

  • schedule them at different times
  • apply allocation locks to ones that benefit from them (e.g. delete indices and set replicas)
  • wait for cluster operations to finish before beginning the run

The biggest concern is starting an allocation lock and not releasing it. We'll want to be careful to implement this well if we choose this route.