Page MenuHomePhabricator

Check for indices that are not compatible with elastic 7.x in production clusters
Closed, ResolvedPublic3 Estimated Story Points

Description

Elasticsearch has a minimum version that indexes must be created after, look up the appropriate versions and compare to cluster state. Recreate indices as needed so all indices are 7.x compatible. Check production and beta cluster.

Event Timeline

EBernhardson set the point value for this task to 3.

This should report the version that every index in the cluster was created with:

curl somecluster:9243/_all/_settings | jq '. | with_entries({key: .key, value: .value.settings.index.version.created})'
{
  "arzwiki_general_1627451433": "6050499",
  "eowiki_general_1627722895": "6050499",
  "enwikinews_general_1617197848": "6050499",
  "metawiki_general_1617725795": "6050499",
  ...
}

To find the minimum acceptable version we check the cluster banner. Will need a 7.10 instance to check against, this is a 6.8 instance.

curl -s somehost:9200 | jq .version.minimum_index_compatibility_version
"5.0.0"

Banner from 7.10.2, minimum compatability is any released version of elastic 6.

{
  "name" : "fcf49ce902e0",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "CH1vnGEcTbmucUIIKnwEkg",
  "version" : {
    "number" : "7.10.2",
    "build_flavor" : "oss",
    "build_type" : "docker",
    "build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
    "build_date" : "2021-01-13T00:42:12.435326Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

We can then run this to get a list of indices that are not compatible:

for cluster in search.svc.{eqiad,codfw}.wmnet cloudelastic.wikimedia.org; do 
  for port in 9{2,4,6}43; do
    echo $cluster:$port; 
    curl https://$cluster:$port/_all/_settings | \
      jq '. | with_entries(.value = .value.settings.index.version.created | select(.value < "6000000"))'
  done
done

Notable in the current list is there are plenty of quite old titlesuggest indices, also the internal .ltrstore and .tasks indices are old. I will have to test, but expect that elastic will manage .tasks on it's own. I suspect titlesuggest indices are old because we try and re-use the daily index where possible. Will check into it and recreate where necessary.

Upgraded metastore across all clusters with and handpicked set of wikis (reviewed cirrusDumpQuery on zh.wikipedia.org and chose sister-sites on different clusters). Various clusters had metastore created in 5.x:

 for cluster in eqiad codfw; do 
  for wiki in enwiki zhwikiversity zhwikivoyage; do
    mwscript extensions/CirrusSearch/maintenance/Metastore.php --wiki $wiki --cluster=$cluster --upgrade
  done
done

Mentioned in SAL (#wikimedia-operations) [2022-05-03T18:04:46Z] <ebernhardson> start ttmserver-export.php from Translate against codfw search cluster for T306811

Change 788773 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] translate: Move ttmserver queries to codfw

https://gerrit.wikimedia.org/r/788773

Change 788773 merged by jenkins-bot:

[operations/mediawiki-config@master] translate: Move ttmserver queries to codfw

https://gerrit.wikimedia.org/r/788773

Mentioned in SAL (#wikimedia-operations) [2022-05-03T20:17:39Z] <cjming@deploy1002> Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:788773|translate: Move ttmserver queries to codfw (T306811)]] (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2022-05-03T20:20:32Z] <ebernhardson> start ttmserver-export.php from Translate against eqiad search cluster for T306811

Change 788820 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] Revert "translate: Move ttmserver queries to codfw"

https://gerrit.wikimedia.org/r/788820

Change 788822 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Move query traffic to codfw for maintenance

https://gerrit.wikimedia.org/r/788822

Change 788820 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "translate: Move ttmserver queries to codfw"

https://gerrit.wikimedia.org/r/788820

Change 788822 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Move query traffic to codfw for maintenance

https://gerrit.wikimedia.org/r/788822

Mentioned in SAL (#wikimedia-operations) [2022-05-03T22:55:47Z] <ebernhardson@deploy1002> Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:788820|Revert "translate: Move ttmserver queries to codfw" (T306811)]] (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2022-05-03T22:57:58Z] <ebernhardson@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788822|cirrus: Move query traffic to codfw for maintenance (T306811)]] (duration: 00m 49s)

Change 788869 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] Revert "cirrus: Move query traffic to codfw for maintenance"

https://gerrit.wikimedia.org/r/788869

Change 788869 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "cirrus: Move query traffic to codfw for maintenance"

https://gerrit.wikimedia.org/r/788869

Mentioned in SAL (#wikimedia-operations) [2022-05-03T23:23:16Z] <ebernhardson@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788869|Revert "cirrus: Move query traffic to codfw for maintenance" (T306811)]] (duration: 00m 56s)

The command above to check clusters for old indices now reports all clear, except for the .tasks indices which are internal to elasticsearch. I'll try and setup a test to install elastic 5.x and migrate up to 7.x, but I'm expecting elastic must manage this index itself. Without looking i suspect their solution is delete and recreate the index, it's used for tracking recent history.

Based on https://discuss.elastic.co/t/deleted-tasks-index-will-that-create-a-problem/170598/2 I've deleted the .tasks indices that were triggering the check and let them recreate themselves. The check now reports all clusters clear of pre-elastic 6 indices.

Gehel claimed this task.