Elasticsearch index creation fails for new wikis
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Tarrow
	Nov 2 2023, 2:58 PM

Description

See: https://console.cloud.google.com/errors/detail/COzs4uqBqeePvgE

Looks like this job failed for an unknown reason. This is happening on Elasticsearch 7 and also appears to have happened in the past on ES 6.

We notice this seems to have happened for what appears to be 2 consecutive wiki ids. We wish to keep an eye on this. If it's happening for more it may actually be broken for everyone.

Patches

Related Objects

Mentioned In: T354048: Elasticsearch cluster issue / search outage
T335894: 🟣 Remove Elasticsearch 6.8.22 cluster

Event Timeline

Tarrow created this task.Nov 2 2023, 2:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 2 2023, 2:58 PM

Andrew-WMDE claimed this task.Nov 2 2023, 3:13 PM

Andrew-WMDE edited projects, added Wikibase Cloud (Kanban board Q4 2023); removed Wikibase Cloud.

Andrew-WMDE moved this task from To do to Doing on the Wikibase Cloud (Kanban board Q4 2023) board.

Andrew-WMDE updated the task description. (Show Details)Nov 3 2023, 11:19 AM

The root cause behind the index failures is that we have reached our cluster's shard limit:

⧼Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [2000]/[2000] maximum shards open;⧽

Our temporary work around was to increase cluster.max_shards_per_node to 1200 and rerun the indexing jobs for any instances that failed.

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/size-your-shards.html#_this_action_would_add_x_total_shards_but_this_cluster_currently_has_yz_maximum_shards_open

Andrew-WMDE updated the task description. (Show Details)Nov 8 2023, 11:41 AM

A better long-term solution might be to add an additional data node on production.

Steps:

Configure MediaWiki and CirrusSearch to start creating indices with a one replica limit
Update all existing indices to scale to at most one replica using:

PUT {CLUSTER}/mwdb_*/_settings
{
    "index": {
        "auto_expand_replicas": "0-1"
    }
}

Deploy an additional data node on production

Andrew-WMDE renamed this task from ElasticsearchInit failed for a new wiki to Elasticsearch index creation fails for new wikis.Nov 8 2023, 11:55 AM

Andrew-WMDE removed Andrew-WMDE as the assignee of this task.Nov 8 2023, 12:03 PM

Andrew-WMDE moved this task from Doing to In Review on the Wikibase Cloud (Kanban board Q4 2023) board.

Andrew-WMDE subscribed.

Deniz_WMDE moved this task from In Review to Waiting for Deploy to Staging on the Wikibase Cloud (Kanban board Q4 2023) board.Nov 8 2023, 4:29 PM

Andrew-WMDE claimed this task.Nov 13 2023, 9:50 AM

Andrew-WMDE updated the task description. (Show Details)Nov 13 2023, 10:53 AM

Andrew-WMDE moved this task from Waiting for Deploy to Staging to Waiting for Deploy to Production on the Wikibase Cloud (Kanban board Q4 2023) board.Nov 13 2023, 2:33 PM

Andrew-WMDE updated the task description. (Show Details)Nov 14 2023, 9:43 AM

Manually run index creation jobs for instances where it failed

Andrew-WMDE mentioned this in T335894: 🟣 Remove Elasticsearch 6.8.22 cluster.Nov 14 2023, 10:47 AM

Andrew-WMDE removed Andrew-WMDE as the assignee of this task.Nov 15 2023, 9:58 AM

Andrew-WMDE moved this task from Waiting for Deploy to Production to Done on the Wikibase Cloud (Kanban board Q4 2023) board.

In T350404#9315836, @Andrew-WMDE wrote:
The root cause behind the index failures is that we have reached our cluster's shard limit:
⧼Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [2000]/[2000] maximum shards open;⧽
Our temporary work around was to increase cluster.max_shards_per_node to 1200 and rerun the indexing jobs for any instances that failed.

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/size-your-shards.html#_this_action_would_add_x_total_shards_but_this_cluster_currently_has_yz_maximum_shards_open

Reverted cluster.max_shards_per_node back to 1000

Due to critical heap usage I'll be limiting cluster.max_shards_per_node to 800. This is still well above Elastic's recommendation of 640 shards for 32GB of heap. When we run out of shards in the future we can incrementally increase it as long as the heap usage remains within reason. We need to add additional data nodes when we can no longer increase the limit or heap size.

20 shards or fewer per GB of heap memory, see https://www.elastic.co/guide/en/elasticsearch/reference/7.17/size-your-shards.html#shard-count-recommendation
Heap usage should not exceed 85%, see https://www.elastic.co/guide/en/elasticsearch/reference/current/high-jvm-memory-pressure.html

Evelien_WMDE closed this task as Resolved.Nov 30 2023, 2:27 PM

Evelien_WMDE claimed this task.

Tarrow mentioned this in T354048: Elasticsearch cluster issue / search outage.Dec 27 2023, 9:29 AM

Elasticsearch index creation fails for new wikisClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Elasticsearch index creation fails for new wikis
Closed, ResolvedPublic
Actions