Page MenuHomePhabricator

Upgrade Datahub to v0.10.0
Open, MediumPublic3 Estimated Story Points

Description

ToDo

Follow: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/DataHub/Upgrading

  • Clone wmf datahub
  • Checkout locally and Add the upstream remote if it does not already exist
  • Pull the master branch from the upstream remote.
  • Push the master branch from the upstream repository to our gerrit repository.
  • Push the tags to the remote gerrit repository
  • Make the needed changes
  • Rebase current branch against the tag of the new version v0.10.0
  • Fix merge conflicts
  • Force-push branch to gerrit
  • Create a feature branch in the deployment-charts repository and update the image version in the helm charts
  • Create a feature branch in the packaged-environments repository and update the datahub version

Event Timeline

After evaluating Datahub versions to upgrade to.
We settled on the latest version 0.10.0 . With these as the main breaking changes .

Some notes as well on Potential Downtime caused by Search improvements which require re indexing indices. A system-update job will run which will set indices to read-only and create a backup/clone of each index. During the re indexing new components will be prevented from start-up until the re index completes. The logs of this job will indicate a % complete per index. Depending on index sizes and infrastructure this process can take 5 minutes to hours however as a rough estimate 1 hour for every 2.3 million entities.

Re: the breaking changes:

#7103 This should only impact users who have configured explicit non-default names for DataHub's Kafka topics. The environment variables used to configure Kafka topics for DataHub used in the kafka-setup docker image have been updated to be in-line with other DataHub components, for more info see our docs on Configuring Kafka in DataHub . They have been suffixed with _TOPIC where as now the correct suffix is _TOPIC_NAME. This change should not affect any user who is using default Kafka names.

We do use the default names, so hopefully this won't affect us.
However, we don't use their standard build mechanism, so we need to make sure that if they have introduced new environment variables or significantly changed the contents of the containers, then we have to backport or otherwise bring these changes into our build process.

For example, with the search index rebuild operation, they look to have introduced two new variables: ELASTICSEARCH_INDEX_BUILDER_SETTINGS_REINDEX and ELASTICSEARCH_INDEX_BUILDER_MAPPINGS_REINDEX
https://github.com/datahub-project/datahub/pull/6983

These appear to have been added to the docker.env file for their GMS server container: https://github.com/datahub-project/datahub/pull/6983/files#diff-060a794fa916b432ceb90ae56ea028a77510ae3659f4046b59bfddf2697e04c4

We might need to add support for the same variables to our helm chart for the datahub-gms component a bit like this: https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/datahub/charts/datahub-gms/templates/_containers.tpl#L76-L78

I wouldn't expect our search index rebuild to take hours at all, as we only have a few thousand entities.

There are some updates to the default kafka topic creation here https://github.com/datahub-project/datahub/blob/master/docker/kafka-setup/kafka-setup.sh#L56 mainly optimizing the process and introducing the 'METADATA_CHANGE_PROPOSAL_TOPIC_NAME' and the '$METADATA_CHANGE_PROPOSAL_TOPIC_NAME' at this level/stage.

A new Infinite retention upgrade topic is created in 'DATAHUB_UPGRADE_HISTORY_TOPIC_NAME'

Stevemunene renamed this task from Upgrade Datahub to >= 0.95 to Upgrade Datahub to v0.10.0.Mar 8 2023, 11:35 AM
Stevemunene updated the task description. (Show Details)
Stevemunene removed a subscriber: EChetty.

More details on the search index rebuild operation, this is to enable the Stemming and Synonyms Support feature.
This feature adds features to our current search implementation in an effort to make search results more relevant. Included improvements are:

  • Stemming - Using a multi-language stemmer to allow better partial matching based on lexicographical roots i.e. "log" resolves from logs, logging. logger etc.
  • Urn matching - Both partial and full Urns previously did not give desirable behavior in search results, these are now properly indexed and queried to give better matching results
  • Word breaks across special characters - Previously when typing in a query like "logging_events", autocomplete would fail to resolve results after typing in the underscore until at least "logging_eve" had been typed and the same would occur with spaces. This has been resolved.
  • Synonyms - A list of synonyms that will match across search results, currently static, has been added. We will be evolving this list over time to improve matching jargon versions of words to their full word equivalent. For example, typing in staging in a query can resolve datasets with stg in their name.

Detailed are the relevant changes to the docker env variables and the helm values.
The helm values will be updated from current to below which starts the system upgrade job

global:
  elasticsearch:
    host: "elasticsearch-master"
    port: "9200"
    index:
      enableMappingsReindex: false
      enableSettingsReindex: false
     ## The following options control settings for datahub-upgrade job when creating or reindexing indices
    upgrade:
        enabled: true
        ## When reindexing is required, this option will clone the existing index as a backup
        cloneIndices: true
        ## This setting allows continuing if and only if the cloneIndices setting is also enabled which
        ## ensures a complete backup of the original index is preserved.
        allowDocCountMismatch: false

Docker env variables are listed as
ELASTICSEARCH_INDEX_BUILDER_MAPPINGS_REINDEX - Controls whether to perform a reindex for mappings mismatches
ELASTICSEARCH_INDEX_BUILDER_SETTINGS_REINDEX - Controls whether to perform a reindex for settings mismatches
ELASTICSEARCH_BUILD_INDICES_ALLOW_DOC_COUNT_MISMATCH - Used in conjunction with ELASTICSEARCH_BUILD_INDICES_CLONE_INDICES to allow users to skip passed document count mismatches when reindexing. Count mismatches may indicate dropped records during the reindex, so to prevent data loss this is only allowed if cloning is enabled.
ELASTICSEARCH_BUILD_INDICES_CLONE_INDICES - Enables creating a clone of the current index to prevent data loss, default true
ELASTICSEARCH_BUILD_INDICES_INITIAL_BACK_OFF_MILLIS - Controls the GMS and MCL Consumer backoff for checking if the reindex process has completed during start up. It is recommended to leave the defaults which will result in waiting up to ~5 minutes before killing the start-up process, allowing a new pod to attempt to start up in orchestrated deployments.
ELASTICSEARCH_BUILD_INDICES_MAX_BACK_OFFS
ELASTICSEARCH_BUILD_INDICES_BACK_OFF_FACTOR
ELASTICSEARCH_BUILD_INDICES_WAIT_FOR_BUILD_INDICES - Controls whether to require waiting for the Build Indices job to finish. Defaults to true. It is not recommended to change this as it will allow GMS and MCL Consumers to start up in an error state.

More Stemming and Synonyms Support

Change 898956 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[analytics/datahub@wmf] Build datahub v0.10.0 containers

https://gerrit.wikimedia.org/r/898956

Change 900310 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Experimental refactor of the datahub container build process

https://gerrit.wikimedia.org/r/900310

I have made a little change that fixes the build process for datahub-frontend.
https://gerrit.wikimedia.org/r/c/analytics/datahub/+/903262

It also fixes the ./build_containers_locally.sh script, so it is once again possible to build the containers on a workstation and inspect the intermediate build artifacts.

In this case, the problem with the frontend was that they had started adding a version number to the zip file that they create, so I was able to fix it with this change.

Containers for version 10.1 have been built and published with the tag: 10aa8d44603f61ddfdb0863834a14f5dbc6f534d-production so we can now try running these in staging to see whether they work, or whether further changes ot the helm charts or containers are required.

I suspect that we may need to work on adding the datahub-upgrade and datahub-actions containers to our build process before long, but I'm not sure yet whether or not it's a pre-requisite for this upgrade.

Change 904487 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump datahub version to 0.10.0 and re-enable standalone consumers

https://gerrit.wikimedia.org/r/904487

Change 904487 merged by jenkins-bot:

[operations/deployment-charts@master] Bump datahub version to 0.10.0 and re-enable standalone consumers

https://gerrit.wikimedia.org/r/904487

I did a test deploy of this to staging, but the mae-consumer and mce-consumer pods failed to start.

The mae-consumer didn't start because of a missing jar.

Error: Unable to access jarfile /datahub/datahub-mae-consumer/bin/mae-consumer-job.jar

The mce-consumer was a bit more complicated:

Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'ebeanServer' defined in class path resource [com/linkedin/gms/factory/entity/EbeanServerFactory.class]: Bean instantiation via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [io.ebean.EbeanServer]: Factory method 'createServer' threw exception; nested exception is java.lang.NullPointerException

I think I might run these as part of the main GMS process again.

Change 904517 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Run the datahub consumers in the GMS context

https://gerrit.wikimedia.org/r/904517

Change 904517 merged by jenkins-bot:

[operations/deployment-charts@master] Run the datahub consumers in the GMS context

https://gerrit.wikimedia.org/r/904517

The gms and frontend containers run with 0.10.0 but there appears to be an issue, because the users and groups do not show up.

image.png (543×1 px, 52 KB)

This is presumably something to do with https://datahubproject.io/docs/releases/#potential-downtime-1

However, we have a bit of an issue here, because I'm not sure that our staging instance is using discrete elasticsearch indices.

According to the deployment charts, we're supposed to be using a prefix of staging- for the elasticsearch indices.
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/datahub/values-staging.yaml#48

However, the elasticsearch cluster itself doesn't appear to have any indices with this named prefix.

btullis@datahubsearch1001:~$ curl -s http://localhost:9200/_cat/indices|grep staging
btullis@datahubsearch1001:~$

This upgrade has been blocked by T333580: The staging and production deployments of datahub share an Opensearch cluster
With the merging of this, the staging deployment will start out with empty indices again.

We may be able to regenerate the indices manually, or we may need to take the same path that the upsream project has, which is to create a new container named datahub-upgrade.

Change 904820 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Remove the hyphen from the datahub staging elasticsearch prefix

https://gerrit.wikimedia.org/r/904820

Change 904820 merged by jenkins-bot:

[operations/deployment-charts@master] Remove the hyphen from the datahub staging elasticsearch prefix

https://gerrit.wikimedia.org/r/904820

I've been trying all sorts of things to try to get this to work, but I'm still unable to login to the staging instance of datahub.

The main error seems to be coming from the GMS component, which is unable to locate an elasticsearch index.

Suppressed: org.elasticsearch.client.ResponseException: method [POST], host [http://datahubsearch.svc.eqiad.wmnet:9200], URI [/staging_datahubpolicyindex_v2/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true], status line [HTTP/1.1 404 Not Found]
{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index 

[staging_datahubpolicyindex_v2]","index":"staging_datahubpolicyindex_v2","resource.id":"staging_datahubpolicyindex_v2","resource.type":"index_or_alias","index_uuid":"_na_"}],"type":"index_not_found_exception","reason":"no such index [staging_datahubpolicyindex_v2]","index":"staging_datahubpolicyindex_v2","resource.id":"staging_datahubpolicyindex_v2","resource.type":"index_or_alias","index_uuid":"_na_"},"status":404}

However, in the past we have never had to create these indices, it has been done by the GMS component itself.

I've been looking at the latest create-indices.sh script from upstream and running the commands manually to setup opensearch for datahub in staging.

However, nothing that I can see references the datahubpolicyindex_v2 index during this setup phase.

The three resources we need in elasticsearch at setup time are:

  • _opendistro/_ism/policies/staging_datahub_usage_event_policy which is an index retention policy.
  • _template/staging_datahub_usage_event_index_template which is an index template.
  • staging_datahub_usage_event-000001 which is the first index, based on this template and managed by this retention policy.

The following three commands on any datahubsearch100[1-3] server show that these exist:
curl -s http://localhost:9200/_opendistro/_ism/policies/staging_datahub_usage_event_policy | jq .
curl -s http://localhost:9200/_template/staging_datahub_usage_event_index_template|jq .
curl -s http://localhost:9200/staging_datahub_usage_event-000001|jq .

I've tried re-initializing the mariadb database in staging, but that didn't work.

Looking at that index on the production instance yields this.

btullis@datahubsearch1001:~$ curl -s http://localhost:9200/datahubpolicyindex_v2|jq .
{
  "datahubpolicyindex_v2_1661860445517": {
    "aliases": {
      "datahubpolicyindex_v2": {}
    },
    "mappings": {
      "properties": {
        "description": {
          "type": "keyword",
          "normalizer": "keyword_normalizer",
          "fields": {
            "delimited": {
              "type": "text",
              "analyzer": "word_delimited"
            },
            "keyword": {
              "type": "keyword"
            }
          }
        },
        "displayName": {
          "type": "keyword",
          "normalizer": "keyword_normalizer",
          "fields": {
            "delimited": {
              "type": "text",
              "analyzer": "word_delimited"
            },
            "keyword": {
              "type": "keyword"
            },
            "ngram": {
              "type": "text",
              "analyzer": "partial"
            }
          }
        },
        "lastUpdatedTimestamp": {
          "type": "date"
        },
        "runId": {
          "type": "keyword"
        },
        "urn": {
          "type": "keyword"
        }
      }
    },
    "settings": {
      "index": {
        "max_ngram_diff": "17",
        "number_of_shards": "1",
        "provided_name": "datahubpolicyindex_v2_1661860445517",
        "creation_date": "1661860445529",
        "analysis": {
          "filter": {
            "partial_filter": {
              "type": "edge_ngram",
              "min_gram": "3",
              "max_gram": "20"
            },
            "custom_delimiter": {
              "type": "word_delimiter",
              "preserve_original": "true",
              "split_on_numerics": "false"
            },
            "urn_stop_filter": {
              "type": "stop",
              "stopwords": [
                "urn",
                "li",
                "container",
                "datahubpolicy",
                "datahubaccesstoken",
                "datahubupgrade",
                "corpgroup",
                "dataprocess",
                "mlfeaturetable",
                "mlmodelgroup",
                "datahubexecutionrequest",
                "invitetoken",
                "datajob",
                "assertion",
                "dataplatforminstance",
                "schemafield",
                "tag",
                "glossaryterm",
                "mlprimarykey",
                "dashboard",
                "notebook",
                "mlmodeldeployment",
                "datahubretention",
                "dataplatform",
                "corpuser",
                "test",
                "mlmodel",
                "glossarynode",
                "mlfeature",
                "dataflow",
                "datahubingestionsource",
                "domain",
                "telemetry",
                "datahubsecret",
                "dataset",
                "chart",
                "dataprocessinstance"
              ]
            }
          },
          "normalizer": {
            "keyword_normalizer": {
              "filter": [
                "lowercase",
                "asciifolding"
              ]
            }
          },
          "analyzer": {
            "browse_path_hierarchy": {
              "tokenizer": "path_hierarchy"
            },
            "slash_pattern": {
              "filter": [
                "lowercase"
              ],
              "tokenizer": "slash_tokenizer"
            },
            "partial_urn_component": {
              "filter": [
                "lowercase",
                "urn_stop_filter",
                "custom_delimiter",
                "partial_filter"
              ],
              "tokenizer": "urn_char_group"
            },
            "word_delimited": {
              "filter": [
                "custom_delimiter",
                "lowercase",
                "stop"
              ],
              "tokenizer": "main_tokenizer"
            },
            "partial": {
              "filter": [
                "custom_delimiter",
                "lowercase",
                "partial_filter"
              ],
              "tokenizer": "main_tokenizer"
            },
            "urn_component": {
              "filter": [
                "lowercase",
                "urn_stop_filter",
                "custom_delimiter"
              ],
              "tokenizer": "urn_char_group"
            },
            "custom_keyword": {
              "filter": [
                "lowercase",
                "asciifolding"
              ],
              "tokenizer": "keyword"
            }
          },
          "tokenizer": {
            "main_tokenizer": {
              "pattern": "[ ./]",
              "type": "pattern"
            },
            "slash_tokenizer": {
              "pattern": "[/]",
              "type": "pattern"
            },
            "urn_char_group": {
              "pattern": "[:\\s(),]",
              "type": "pattern"
            }
          }
        },
        "number_of_replicas": "1",
        "uuid": "yV5VuaDtRtGQRBNgW9JmcA",
        "version": {
          "created": "135238227"
        }
      }
    }
  }
}

Investigation continues.

JArguello-WMF set the point value for this task to 3.Apr 25 2023, 2:11 PM

I'm returning to look at this issue again. The furst thing I'm going to try to do is to look at building the datahub-upgrade image.
I believe that this might be required due to the requirements identified by @Stevemunene in T329514#8690904.

For now, I am going to add the datahub-upgrade container to the existing blubber pipeline.
I have identified some shortcomings with this build process and documented them in: T303381: Review and improve the build process for DataHub containers

If there's any progress that I can make on refactoring the build process whilst investigating this issue, I will do so.

Change 916483 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Add a datahub-upgrade container

https://gerrit.wikimedia.org/r/916483

Change 916483 merged by Btullis:

[analytics/datahub@wmf] Add a datahub-upgrade container

https://gerrit.wikimedia.org/r/916483

Change 917868 had a related patch set uploaded (by Btullis; author: Btullis):

[integration/config@master] Add pipelines for a datahub-upgrade container

https://gerrit.wikimedia.org/r/917868

I have created this change on the integration/config repository, which allows our datahub fork to build the datahub-upgrade container. Currently awaiting a review.

Change 917868 merged by jenkins-bot:

[integration/config@master] Add pipelines for a datahub-upgrade container

https://gerrit.wikimedia.org/r/917868

Change 918466 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the container image used to run datahub

https://gerrit.wikimedia.org/r/918466

Change 918466 merged by jenkins-bot:

[operations/deployment-charts@master] Update the container image used to run datahub

https://gerrit.wikimedia.org/r/918466