User Details
- User Since
- Jun 22 2015, 1:16 PM (456 w, 23 h)
- Availability
- Available
- LDAP User
- Andrew-WMDE
- MediaWiki User
- Andrew Kostka (WMDE) [ Global Accounts ]
Fri, Mar 15
Thu, Mar 14
- Create any missing indices
- Re-index all wikis starting from 2024-02-01
Wed, Mar 13
Waiting for cluster to stabilize again after expansion
Tue, Mar 12
Mon, Mar 11
Waiting on T359791
We split Elasticsearch's master and data nodes into their own GKE node pools and configured those pools to use a blue-green upgrade strategy. That way when a GKE node upgrade runs, only one Elasticsearch node will be taken down at a time. Since our Elasticsearch shards have node redundancy, search should continue to operate normally even with a slightly degraded cluster.
Feb 2 2024
Feb 1 2024
Waiting on T350394
Jan 31 2024
Jan 30 2024
Jan 23 2024
Jan 22 2024
Jan 19 2024
Jan 16 2024
Jan 12 2024
Elasticsearch shards can be rebalanced using:
Steps taken to stabilize the cluster:
- Added an additional Elasticsearch data node on production
- Reduced max shards per node back down to 800
- Increased startup probe timeout to 6hrs (it now takes slightly more than 4hrs for a data node to fully rejoin)
- Rebalanced all shards
Jan 9 2024
Recap: We hit our shard limit set in T350404#9340256 on 25th of December. This limit was then bumped to 850 as we still had enough heap left on our data nodes to accommodate these extra shards. Then on the 28th the master nodes started to OOM.
Dec 21 2023
Dec 19 2023
Dec 15 2023
Dec 6 2023
Nov 21 2023
Nov 20 2023
Nov 17 2023
Due to critical heap usage I'll be limiting cluster.max_shards_per_node to 800. This is still well above Elastic's recommendation of 640 shards for 32GB of heap. When we run out of shards in the future we can incrementally increase it as long as the heap usage remains within reason. We need to add additional data nodes when we can no longer increase the limit or heap size.
Nov 16 2023
Nov 15 2023
Nov 14 2023
Waiting on T350404
- Manually run index creation jobs for instances where it failed