Page MenuHomePhabricator

Curator failed to delete indices in codfw
Closed, ResolvedPublic

Description

I noticed curator failed on logstash2026 last night, apparently a race between DELETE and then trying to find the indices again to attempt the deletion again

Oct 10 00:42:07 logstash2026 curator[3694024]: 2023-10-10 00:42:07,175 INFO      Deleting 6 selected indices: ['logstash-webrequest-1-7.0.0-1-20
23.07.11', 'logstash-deploy-1-7.0.0-1-2023.07.11', 'logstash-syslog-1-7.0.0-1-2023.07.11', 'logstash-mediawiki-1-7.0.0-1-2023.07.11', 'logstash-k8s-1-7.0.0-1-2023.07.11', 'logstash-default-1-7.0.0-1-2023.07.11']                                                                             Oct 10 00:42:07 logstash2026 curator[3694024]: 2023-10-10 00:42:07,175 INFO      ---deleting index logstash-webrequest-1-7.0.0-1-2023.07.11
Oct 10 00:42:07 logstash2026 curator[3694024]: 2023-10-10 00:42:07,176 INFO      ---deleting index logstash-deploy-1-7.0.0-1-2023.07.11
Oct 10 00:42:07 logstash2026 curator[3694024]: 2023-10-10 00:42:07,176 INFO      ---deleting index logstash-syslog-1-7.0.0-1-2023.07.11
Oct 10 00:42:07 logstash2026 curator[3694024]: 2023-10-10 00:42:07,176 INFO      ---deleting index logstash-mediawiki-1-7.0.0-1-2023.07.11
Oct 10 00:42:07 logstash2026 curator[3694024]: 2023-10-10 00:42:07,176 INFO      ---deleting index logstash-k8s-1-7.0.0-1-2023.07.11
Oct 10 00:42:07 logstash2026 curator[3694024]: 2023-10-10 00:42:07,176 INFO      ---deleting index logstash-default-1-7.0.0-1-2023.07.11
Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,079 INFO      DELETE http://127.0.0.1:9200/logstash-default-1-7.0.0-1-2023.07
.11,logstash-deploy-1-7.0.0-1-2023.07.11,logstash-k8s-1-7.0.0-1-2023.07.11,logstash-mediawiki-1-7.0.0-1-2023.07.11,logstash-syslog-1-7.0.0-1-2023.07.11,logstash-webrequest-1-7.0.0-1-2023.07.11?master_timeout=120s [status:200 request:29.903s]                                               Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,181 INFO      GET http://127.0.0.1:9200/_all/_settings?expand_wildcards=open%
2Cclosed [status:200 request:0.100s]                                                                                                            Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,200 INFO      GET http://127.0.0.1:9200/ [status:200 request:0.003s]
Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,303 INFO      GET http://127.0.0.1:9200/_all/_settings?expand_wildcards=open%
2Cclosed [status:200 request:0.102s]                                                                                                            Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,316 INFO      GET http://127.0.0.1:9200/ [status:200 request:0.002s]
Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,416 INFO      GET http://127.0.0.1:9200/_all/_settings?expand_wildcards=open%
2Cclosed [status:200 request:0.100s]                                                                                                            Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,440 INFO      GET http://127.0.0.1:9200/ [status:200 request:0.002s]
Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,539 INFO      GET http://127.0.0.1:9200/_all/_settings?expand_wildcards=open%
2Cclosed [status:200 request:0.098s]                                                                                                            Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,556 INFO      GET http://127.0.0.1:9200/ [status:200 request:0.003s]
Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,652 INFO      GET http://127.0.0.1:9200/_all/_settings?expand_wildcards=open%
2Cclosed [status:200 request:0.095s]                                                                                                            Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,664 INFO      GET http://127.0.0.1:9200/ [status:200 request:0.002s]
Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,755 INFO      GET http://127.0.0.1:9200/_all/_settings?expand_wildcards=open%
2Cclosed [status:200 request:0.090s]                                                                                                            Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,764 INFO      GET http://127.0.0.1:9200/ [status:200 request:0.001s]
Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,764 ERROR     The following indices failed to delete on try #1:
Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,764 ERROR     ---logstash-webrequest-1-7.0.0-1-2023.07.11
Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,764 ERROR     ---logstash-deploy-1-7.0.0-1-2023.07.11
Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,764 INFO      ---deleting index logstash-webrequest-1-7.0.0-1-2023.07.11
Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,764 INFO      ---deleting index logstash-deploy-1-7.0.0-1-2023.07.11
Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,768 WARNING   DELETE http://127.0.0.1:9200/logstash-deploy-1-7.0.0-1-2023.07.
11,logstash-webrequest-1-7.0.0-1-2023.07.11?master_timeout=120s [status:404 request:0.004s]                                                     Oct 10 00:42:37 logstash2026 curator[3694024]: 2023-10-10 00:42:37,768 ERROR     Failed to complete action: delete_indices.  <class 'curator.exc
eptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: NotFoundError(404, 'index_not_found_exception', 'no such index [logstash-deploy-1-7.0.0-1-2023.07.11]', logstash-deploy-1-7.0.0-1-2023.07.11, index_or_alias)                                                                                                                                     Oct 10 00:42:37 logstash2026 systemd[1]: curator_actions_cluster_wide.service: Main process exited, code=exited, status=1/FAILURE
Oct 10 00:42:37 logstash2026 systemd[1]: curator_actions_cluster_wide.service: Failed with result 'exit-code'.
Oct 10 00:42:37 logstash2026 systemd[1]: Failed to start OpenSearch Curator action cluster_wide.
Oct 10 00:42:37 logstash2026 systemd[1]: curator_actions_cluster_wide.service: Consumed 1.542s CPU time.

Filing this for tracking, and see if it comes back, I've kicked off curator manually for now. cc @colewhite

Event Timeline

Same story today

Oct 11 00:42:04 logstash2026 curator[3937314]: 2023-10-11 00:42:04,129 INFO      GET http://127.0.0.1:9200/logstash-default-1-7.0.0-1-2023.07.12,logstash-deploy-1-7.0.0-1-2023.07.12,logstash-k8s-1-7.0.0-1-2023.07.12,logstash-mediawiki-1-7.
0.0-1-2023.07.12,logstash-syslog-1-7.0.0-1-2023.07.12,logstash-webrequest-1-7.0.0-1-2023.07.12/_settings [status:200 request:0.004s]                                                                                                           Oct 11 00:42:04 logstash2026 curator[3937314]: 2023-10-11 00:42:04,129 INFO      Deleting 6 selected indices: ['logstash-k8s-1-7.0.0-1-2023.07.12', 'logstash-deploy-1-7.0.0-1-2023.07.12', 'logstash-webrequest-1-7.0.0-1-2023.07.12', 'logsta
sh-syslog-1-7.0.0-1-2023.07.12', 'logstash-default-1-7.0.0-1-2023.07.12', 'logstash-mediawiki-1-7.0.0-1-2023.07.12']                                                                                                                           Oct 11 00:42:04 logstash2026 curator[3937314]: 2023-10-11 00:42:04,129 INFO      ---deleting index logstash-k8s-1-7.0.0-1-2023.07.12
Oct 11 00:42:04 logstash2026 curator[3937314]: 2023-10-11 00:42:04,129 INFO      ---deleting index logstash-deploy-1-7.0.0-1-2023.07.12
Oct 11 00:42:04 logstash2026 curator[3937314]: 2023-10-11 00:42:04,129 INFO      ---deleting index logstash-webrequest-1-7.0.0-1-2023.07.12
Oct 11 00:42:04 logstash2026 curator[3937314]: 2023-10-11 00:42:04,129 INFO      ---deleting index logstash-syslog-1-7.0.0-1-2023.07.12
Oct 11 00:42:04 logstash2026 curator[3937314]: 2023-10-11 00:42:04,129 INFO      ---deleting index logstash-default-1-7.0.0-1-2023.07.12
Oct 11 00:42:04 logstash2026 curator[3937314]: 2023-10-11 00:42:04,129 INFO      ---deleting index logstash-mediawiki-1-7.0.0-1-2023.07.12
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,050 INFO      DELETE http://127.0.0.1:9200/logstash-default-1-7.0.0-1-2023.07.12,logstash-deploy-1-7.0.0-1-2023.07.12,logstash-k8s-1-7.0.0-1-2023.07.12,logstash-mediawiki-1
-7.0.0-1-2023.07.12,logstash-syslog-1-7.0.0-1-2023.07.12,logstash-webrequest-1-7.0.0-1-2023.07.12?master_timeout=120s [status:200 request:29.921s]                                                                                             Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,155 INFO      GET http://127.0.0.1:9200/_all/_settings?expand_wildcards=open%2Cclosed [status:200 request:0.104s]
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,174 INFO      GET http://127.0.0.1:9200/ [status:200 request:0.003s]
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,263 INFO      GET http://127.0.0.1:9200/_all/_settings?expand_wildcards=open%2Cclosed [status:200 request:0.089s]
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,273 INFO      GET http://127.0.0.1:9200/ [status:200 request:0.002s]
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,368 INFO      GET http://127.0.0.1:9200/_all/_settings?expand_wildcards=open%2Cclosed [status:200 request:0.094s]
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,399 INFO      GET http://127.0.0.1:9200/ [status:200 request:0.002s]
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,507 INFO      GET http://127.0.0.1:9200/_all/_settings?expand_wildcards=open%2Cclosed [status:200 request:0.107s]
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,524 INFO      GET http://127.0.0.1:9200/ [status:200 request:0.003s]
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,619 INFO      GET http://127.0.0.1:9200/_all/_settings?expand_wildcards=open%2Cclosed [status:200 request:0.094s]
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,631 INFO      GET http://127.0.0.1:9200/ [status:200 request:0.002s]
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,724 INFO      GET http://127.0.0.1:9200/_all/_settings?expand_wildcards=open%2Cclosed [status:200 request:0.092s]
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,736 INFO      GET http://127.0.0.1:9200/ [status:200 request:0.003s]
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,737 ERROR     The following indices failed to delete on try #1:
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,737 ERROR     ---logstash-k8s-1-7.0.0-1-2023.07.12
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,737 ERROR     ---logstash-deploy-1-7.0.0-1-2023.07.12
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,737 INFO      ---deleting index logstash-k8s-1-7.0.0-1-2023.07.12
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,737 INFO      ---deleting index logstash-deploy-1-7.0.0-1-2023.07.12
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,744 WARNING   DELETE http://127.0.0.1:9200/logstash-deploy-1-7.0.0-1-2023.07.12,logstash-k8s-1-7.0.0-1-2023.07.12?master_timeout=120s [status:404 request:0.006s]
Oct 11 00:42:34 logstash2026 curator[3937314]: 2023-10-11 00:42:34,744 ERROR     Failed to complete action: delete_indices.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elast
icsearch logs for more information. Exception: NotFoundError(404, 'index_not_found_exception', 'no such index [logstash-deploy-1-7.0.0-1-2023.07.12]', logstash-deploy-1-7.0.0-1-2023.07.12, index_or_alias)                                   Oct 11 00:42:34 logstash2026 systemd[1]: curator_actions_cluster_wide.service: Main process exited, code=exited, status=1/FAILURE
Oct 11 00:42:34 logstash2026 systemd[1]: curator_actions_cluster_wide.service: Failed with result 'exit-code'.
Oct 11 00:42:34 logstash2026 systemd[1]: Failed to start OpenSearch Curator action cluster_wide.
Oct 11 00:42:34 logstash2026 systemd[1]: curator_actions_cluster_wide.service: Consumed 1.678s CPU time.

How long the operation took is suspiciously close to 30s, which the default timeout parameter value, maybe we should bump that in addition to (or instead of) master_timeout ? upstream docs

colewhite changed the task status from Open to In Progress.Oct 11 2023, 5:18 PM
colewhite claimed this task.

How long the operation took is suspiciously close to 30s, which the default timeout parameter value, maybe we should bump that in addition to (or instead of) master_timeout ? upstream docs

Possibly. The behavior is as if OpenSearch responds to the delete request prior to actually purging the entries from internal state. I'd expect this action to occur atomically. I've prepared a debug invocation to inspect the requests to and responses from OpenSearch.

When the delete request comes back (actions.py:647-649), curator immediately asks for a list of indexes and double-checks that the indexes are not in the response. If indexes still exist, the working list is overwritten with those indexes and another delete request is issued. This can retry up to 3 times (hardcoded).

It would be ideal if this issue could be addressed in configuration rather than modify how curator determines that the operation was successful.

colewhite triaged this task as Medium priority.Oct 11 2023, 5:19 PM

OpenSearch replies with HTTP 200 {"acknowledged":false} indicating the operation hasn't failed, but has hit the 30s "explicit operation timeout". This is different than master_timeout (which curator is providing) specifying the "timeout for connection to master".

{
  "type": "server",
  "timestamp": "2023-10-12T00:03:43,967Z",
  "level": "WARN",
  "component": "o.o.c.c.C.CoordinatorPublication",
  "cluster.name": "production-elk7-codfw",
  "node.name": "logstash2034-production-elk7-codfw",
  "message": "after [30s] publication of cluster state version [709747] is still waiting for {logstash2001-production-elk7-codfw}{b34abAmfSXW3YnXDTj7aIA}{ULO0iAwvTB-2FA-NANmusA}{10.192.0.112}{10.192.0.112:9300}{dimr}{hostname=logstash2001, shard_indexing_pressure_enabled=true, disktype=hdd, fqdn=logstash2001.codfw.wmnet} [SENT_APPLY_COMMIT], {logstash2003-production-elk7-codfw}{r-edGeI6RNWPR4GoEhWrKQ}{2onquNZ2T1qcg-SqdAaOsw}{10.192.48.131}{10.192.48.131:9300}{dimr}{hostname=logstash2003, shard_indexing_pressure_enabled=true, disktype=hdd, fqdn=logstash2003.codfw.wmnet} [SENT_APPLY_COMMIT]",
  "cluster.uuid": "NhSB5TLnT5yi3WDf4K6Qgg",
  "node.id": "BSIDfmU2Qk2qzRwFXpJwgg"
}
{
  "type": "server",
  "timestamp": "2023-10-12T00:03:45,394Z",
  "level": "WARN",
  "component": "o.o.c.s.ClusterApplierService",
  "cluster.name": "production-elk7-codfw",
  "node.name": "logstash2001-production-elk7-codfw",
  "message": "cluster state applier task [ApplyCommitRequest{term=344, version=709747, sourceNode={logstash2034-production-elk7-codfw}{BSIDfmU2Qk2qzRwFXpJwgg}{Yw5rcpqGRyKgBYoe6jknjw}{10.192.16.30}{10.192.16.30:9300}{dimr}{hostname=logstash2034, shard_indexing_pressure_enabled=true, disktype=ssd, fqdn=logstash2034.codfw.wmnet}}] took [31s] which is above the warn threshold of [30s]: [running task [ApplyCommitRequest{term=344, version=709747, sourceNode={logstash2034-production-elk7-codfw}{BSIDfmU2Qk2qzRwFXpJwgg}{Yw5rcpqGRyKgBYoe6jknjw}{10.192.16.30}{10.192.16.30:9300}{dimr}{hostname=logstash2034, shard_indexing_pressure_enabled=true, disktype=ssd, fqdn=logstash2034.codfw.wmnet}}]] took [0ms],
[connecting to new nodes] took [0ms],
[applying settings] took [0ms],
[running applier [org.opensearch.repositories.RepositoriesService@22dd2ada]] took [0ms],
[running applier [org.opensearch.indices.cluster.IndicesClusterStateService@48afbd93]] took [31186ms],
[running applier [org.opensearch.script.ScriptService@1b69e3d2]] took [0ms],
[running applier [org.opensearch.snapshots.RestoreService@10824caf]] took [0ms],
[running applier [org.opensearch.ingest.IngestService@2ba4a1b4]] took [0ms],
[running applier [org.opensearch.search.pipeline.SearchPipelineService@646b1e88]] took [0ms],
[running applier [org.opensearch.action.ingest.IngestActionForwarder@68a965fa]] took [0ms],
[running applier [org.opensearch.action.admin.cluster.repositories.cleanup.TransportCleanupRepositoryAction$$Lambda$4132/0x0000000840e02040@18f05d53]] took [0ms],
[running applier [org.opensearch.tasks.TaskManager@24539f9b]] took [0ms],
[running applier [org.opensearch.snapshots.SnapshotsService@34c128be]] took [0ms],
[notifying listener [org.opensearch.cluster.InternalClusterInfoService@d64d003]] took [0ms],
[notifying listener [org.opensearch.snapshots.InternalSnapshotsInfoService@35733973]] took [0ms],
[notifying listener [org.opensearch.jobscheduler.sweeper.JobSweeper@8ca73bb]] took [0ms],
[notifying listener [org.opensearch.ad.indices.AnomalyDetectionIndices@382908fe]] took [0ms],
[notifying listener [org.opensearch.ad.cluster.ADClusterEventListener@63a5f3eb]] took [0ms],
[notifying listener [org.opensearch.ad.cluster.ClusterManagerEventListener@67a734b4]] took [0ms],
[notifying listener [org.opensearch.alerting.alerts.AlertIndices@393b1091]] took [0ms],
[notifying listener [org.opensearch.alerting.core.JobSweeper@34ecb092]] took [0ms],
[notifying listener [org.opensearch.alerting.util.destinationmigration.DestinationMigrationCoordinator@5495f53]] took [0ms],
[notifying listener [org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementHistory@52dabc67]] took [0ms],
[notifying listener [org.opensearch.indexmanagement.indexstatemanagement.ManagedIndexCoordinator@99bcb65]] took [0ms],
[notifying listener [org.opensearch.indexmanagement.indexstatemanagement.PluginVersionSweepCoordinator@67e1fc6d]] took [0ms],
[notifying listener [org.opensearch.ml.cluster.MLCommonsClusterEventListener@45f53ac]] took [0ms],
[notifying listener [org.opensearch.ml.cluster.MLCommonsClusterManagerEventListener@49804966]] took [0ms],
[notifying listener [org.opensearch.sql.legacy.esdomain.LocalClusterState$$Lambda$2500/0x0000000840abf840@403d4492]] took [0ms],
[notifying listener [org.opensearch.cluster.metadata.SystemIndexMetadataUpgradeService@64fcfa24]] took [0ms],
[notifying listener [org.opensearch.cluster.metadata.TemplateUpgradeService@2e9d474a]] took [0ms],
[notifying listener [org.opensearch.node.ResponseCollectorService@3d34988b]] took [0ms],
[notifying listener [org.opensearch.snapshots.SnapshotShardsService@1787526b]] took [0ms],
[notifying listener [org.opensearch.persistent.PersistentTasksClusterService@14a9b765]] took [0ms],
[notifying listener [org.opensearch.cluster.routing.DelayedAllocationService@504a15ad]] took [0ms],
[notifying listener [org.opensearch.indices.store.IndicesStore@49233fef]] took [0ms],
[notifying listener [org.opensearch.persistent.PersistentTasksNodeService@687f37ba]] took [0ms],
[notifying listener [org.opensearch.securityanalytics.indexmanagment.DetectorIndexManagementService@6460c5e0]] took [0ms],
[notifying listener [org.opensearch.search.asynchronous.management.AsynchronousSearchManagementService@5ca726cb]] took [0ms],
[notifying listener [org.opensearch.gateway.GatewayService@439690f]] took [0ms],
[notifying listener [org.opensearch.indices.recovery.PeerRecoverySourceService@156a4cd5]] took [0ms],
[notifying listener [org.opensearch.indices.replication.SegmentReplicationSourceService@7747d45d]] took [1ms]",
  "cluster.uuid": "NhSB5TLnT5yi3WDf4K6Qgg",
  "node.id": "b34abAmfSXW3YnXDTj7aIA"
}

It doesn't look like there is an option to change the timeout parameter. We'll need to patch curator. 😕

That's indeed not ideal, I'm wondering though if upstream would be interested in the patch at least since it seems generally applicable

Summarizing highlights from my IRC conversation with @fgiunchedi:

  • Upstream seems disinterested in a patch given answers to others is "stop overloading the cluster" or "break up the job into smaller operations"
  • Contributing upstream requires signing a CLA
  • The "correct" solution is to use ILM rather than curator - Non-starter until there is a path to Puppetization
  • We already operate our own OpenSearch-compatible fork already - patching/building/deploying is no problem
  • The fork source needs to be migrated to Gitlab

I've silenced the alert related to this for 60d

Patch deployed on beta for smoketest. Planning to roll ~wmf4 to the new hosts on bookworm.

Post patch, smoketest now successful in beta - testing now on codfw.

Change #1018417 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] opensearch: bump curator version to wmf4

https://gerrit.wikimedia.org/r/1018417

Change #1018417 merged by Cwhite:

[operations/puppet@production] opensearch: bump curator version to wmf4

https://gerrit.wikimedia.org/r/1018417

Optimistically resolving. We will watch for recurrence.