Page MenuHomePhabricator

Upgrade datahub to v0.12.1
Closed, ResolvedPublic

Description

Due to some recent issues with datahub versions below v0.12.0, we shall be upgrading our datahub deployment to v0.12.1. The latest stable release is v0.13.0 but it is only compatible with Java 17 thus we are using v0.12.1 the last stable release that supports Java 11. This shall be our first major upgrade since moving to Gitlab from our previous process. This task is to track the changes and the process required for that.

Change log from v0.10.4 -> v0.12.1
v0.12.1
v0.11.0
v0.10.5

  • Build containers with tag v0.12.1
  • Update the helm charts for staging
  • Deploy on staging
  • Update the helm charts for production
  • Deploy on production

Related Objects

StatusSubtypeAssignedTask
ResolvedStevemunene

Event Timeline

Stevemunene added a parent task: Restricted Task.

Got the following errors during the build process

#32 75.40 FAILURE: Build failed with an exception.
#32 75.40 
#32 75.40 * What went wrong:
#32 75.40 A problem occurred configuring root project 'datahub'.
#32 75.40 > Could not resolve all files for configuration ':classpath'.
#32 75.40    > Could not resolve org.springframework.boot:spring-boot-gradle-plugin:3.2.1.
#32 75.40      Required by:
#32 75.40          project :
#32 75.40       > No matching variant of org.springframework.boot:spring-boot-gradle-plugin:3.2.1 was found. The consumer was configured to find a library for use during runtime, compatible with Java 11, packaged as a jar, and its dependencies declared externally, as well as attribute 'org.gradle.plugin.api-version' with value '8.0.2' but:
#32 75.40           - Variant 'apiElements' capability org.springframework.boot:spring-boot-gradle-plugin:3.2.1 declares a library, packaged as a jar, and its dependencies declared externally:
#32 75.40               - Incompatible because this component declares a component for use during compile-time, compatible with Java 17 and the consumer needed a component for use during runtime, compatible with Java 11
#32 75.40               - Other compatible attribute:
#32 75.40                   - Doesn't say anything about org.gradle.plugin.api-version (required '8.0.2')
#32 75.40           - Variant 'javadocElements' capability org.springframework.boot:spring-boot-gradle-plugin:3.2.1 declares a component for use during runtime, and its dependencies declared externally:
#32 75.41               - Incompatible because this component declares documentation and the consumer needed a library
#32 75.41               - Other compatible attributes:
#32 75.41                   - Doesn't say anything about its target Java version (required compatibility with Java 11)
#32 75.41                   - Doesn't say anything about its elements (required them packaged as a jar)
#32 75.41                   - Doesn't say anything about org.gradle.plugin.api-version (required '8.0.2')
#32 75.41           - Variant 'mavenOptionalApiElements' capability org.springframework.boot:spring-boot-gradle-plugin-maven-optional:3.2.1 declares a library, packaged as a jar, and its dependencies declared externally:
#32 75.41               - Incompatible because this component declares a component for use during compile-time, compatible with Java 17 and the consumer needed a component for use during runtime, compatible with Java 11
#32 75.41               - Other compatible attribute:
#32 75.41                   - Doesn't say anything about org.gradle.plugin.api-version (required '8.0.2')
#32 75.41           - Variant 'mavenOptionalRuntimeElements' capability org.springframework.boot:spring-boot-gradle-plugin-maven-optional:3.2.1 declares a library for use during runtime, packaged as a jar, and its dependencies declared externally:
#32 75.41               - Incompatible because this component declares a component, compatible with Java 17 and the consumer needed a component, compatible with Java 11
#32 75.41               - Other compatible attribute:
#32 75.41                   - Doesn't say anything about org.gradle.plugin.api-version (required '8.0.2')
#32 75.41           - Variant 'runtimeElements' capability org.springframework.boot:spring-boot-gradle-plugin:3.2.1 declares a library for use during runtime, packaged as a jar, and its dependencies declared externally:
#32 75.41               - Incompatible because this component declares a component, compatible with Java 17 and the consumer needed a component, compatible with Java 11
#32 75.41               - Other compatible attribute:
#32 75.41                   - Doesn't say anything about org.gradle.plugin.api-version (required '8.0.2')
#32 75.41           - Variant 'sourcesElements' capability org.springframework.boot:spring-boot-gradle-plugin:3.2.1 declares a component for use during runtime, and its dependencies declared externally:
#32 75.41               - Incompatible because this component declares documentation and the consumer needed a library
#32 75.41               - Other compatible attributes:
#32 75.41                   - Doesn't say anything about its target Java version (required compatibility with Java 11)
#32 75.41                   - Doesn't say anything about its elements (required them packaged as a jar)
#32 75.41                   - Doesn't say anything about org.gradle.plugin.api-version (required '8.0.2')

There have been some changes from datahub with v0.13.0 only supporting Java17 ref: "While it may be possible to build and run DataHub using newer versions of Java, we currently only support Java 17 (aka Java 17)." which we currently do not have in our registry.
Datahub v0.12.1 released on Dec 9 2023 still only supports Java 11 which means it is still eligible for use in the datahub build and still solves the main challenges which we intended to solve with the upgrade. I shall be building with v0.12.1 as we plan on the move to Java11

Stevemunene renamed this task from Upgrade datahub to v0.13.0 to Upgrade datahub to v0.12.1.Apr 4 2024, 8:38 AM
Stevemunene updated the task description. (Show Details)

Change #1019729 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/deployment-charts@master] Upgrading datahub to v0.12.1 We are upgrading datahub to v0.12.1 in response to some vulnerabilities in versions < v0.12.0 v0.13.0 is the latest stable release but is only compatible with Java 17 thus we are using v0.12.1 the last stable release that supports Java 11

https://gerrit.wikimedia.org/r/1019729

Change #1019729 merged by jenkins-bot:

[operations/deployment-charts@master] Upgrading datahub to v0.12.1

https://gerrit.wikimedia.org/r/1019729

Datahub v0.12.1 is successfully running in staging without any issues, we can proceed with the main upgrade.

Mentioned in SAL (#wikimedia-analytics) [2024-04-16T11:02:10Z] <stevemunene> upgrade datahub to v0.12.1 T361688

First upgrade attempt has failed on codfw with some errors on the datahub-main-nocode-migration-job

ANTLR Tool version 4.5 used for code generation does not match the current runtime version 4.8ANTLR Runtime version 4.5 used for parser compilation does not match the current runtime version 4.8ANTLR Tool version 4.5 used for code generation does not match the current runtime version 4.8ANTLR Runtime version 4.5 used for parser compilation does not match the current runtime version 4.8java.net.ConnectException: Connection timed out (Connection timed out)

and

2024-04-16 11:31:12,768 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - ERROR: Cannot connect to GMSat https://host datahub-gms-main-tls-service.datahub.svc.cluster.local port 8501. Make sure GMS is on the latest version and is running at that host before starting the migration.

Currenly investigating this

First upgrade attempt has failed on codfw with some errors on the datahub-main-nocode-migration-job

ANTLR Tool version 4.5 used for code generation does not match the current runtime version 4.8ANTLR Runtime version 4.5 used for parser compilation does not match the current runtime version 4.8ANTLR Tool version 4.5 used for code generation does not match the current runtime version 4.8ANTLR Runtime version 4.5 used for parser compilation does not match the current runtime version 4.8java.net.ConnectException: Connection timed out (Connection timed out)

and

2024-04-16 11:31:12,768 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - ERROR: Cannot connect to GMSat https://host datahub-gms-main-tls-service.datahub.svc.cluster.local port 8501. Make sure GMS is on the latest version and is running at that host before starting the migration.

Currenly investigating this

I've seen a similar thing before, but it might not be quite the same.
I had to add this value: https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/services/datahub/values.yaml#L96

BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATE: false

Maybe something about that behaviour has changed. Were the any useful logs from the GMS component, when it started up?

datahub-gms-main was in error and was rolled back before I could get any error logs from there. Should we revert the BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATE back to the default? mce and mae were all ok

From the community and docs, it seems we do need to update the BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATE https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/services/datahub/values.yaml#L96 when upgrading datahub to a new version since there are tasks performed by the system update job on upgrades. Re: very informative slack dicussion. The downside of this being some downtime which is/was expected for the v0.11.0 upgrade details below:

Source: v0.11.0 Release highlights
Potential Downtime
This release introduces substantial improvements to search ranking which require reindexing indices.

During the reindexing:

  • a system-update job will set indices to read-only and create a backup/clone of each index
  • new components will be prevented from start-up until the reindex completes
  • Helm deployments will go into read-only mode and new ingestion runs will fail

This process can take anywhere from 5 minutes to multiple hours; as a rough estimate, please expect it to take 1 hour for every 2.3 million entities. After the reindex is complete, please check your ingestion run to re-run any that did not complete.

Change #1020295 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/deployment-charts@master] configure datahub to wait for upgrade before starting

https://gerrit.wikimedia.org/r/1020295

Ran into the same issue as previously even with the BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATE set to default. The datahub-main-system-update-job-lnk9h ran to completion and successfully reindexed the indices, then the rest of the pods datahub frontend, mae-consumer,mce-consumer-main, gms-main were all recreated without any error.
The current error is from the datahub-main-nocode-migration-job being unable to access gms-main, with the endpoint returning a 503

2024-04-17 11:51:54,739 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - ERROR: Cannot connect to GMSat https://host datahub-gms-main-tls-service.datahub.svc.cluster.local port 8501. Make sure GMS is on the latest version and is running at that host before starting the migration.
java.io.IOException: Server returned HTTP response code: 503 for URL: https://datahub-gms-main-tls-service.datahub.svc.cluster.local:8501/config

We found that the GMS pod wasn't starting properly on production, so it looks like it's unrelated to BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATE.

The error from the GMS pod was as follows.

2024-04-18 13:41:29,191 [main] INFO  c.l.metadata.boot.BootstrapManager:30 - Executing bootstrap step 5/14 with name IngestDataPlatformsStep...
2024-04-18 13:41:29,731 [main] ERROR c.l.metadata.boot.BootstrapManager:38 - Caught exception while executing bootstrap step IngestDataPlatformsStep. Exiting...
java.lang.NullPointerException: null

I found a useful looking slack post about this error.

It seems that it may be related to the database collation type.

image.png (255×1 px, 39 KB)

Our collation type is seemingly incorrect.

MariaDB [datahub]> SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME
    -> FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'datahub';
+----------------------------+------------------------+
| DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME |
+----------------------------+------------------------+
| latin1                     | latin1_swedish_ci      |
+----------------------------+------------------------+
1 row in set (0.000 sec)

I have changed the collation type with the following command.

MariaDB [datahub]> alter table metadata_aspect_v2 convert to character set utf8mb4 collate utf8mb4_bin;
Query OK, 91737 rows affected (11.482 sec)             
Records: 91737  Duplicates: 0  Warnings: 0

Now we're deploying again.

The deployment was successful and we are now running datahub v0.12.1

Change #1020295 abandoned by Stevemunene:

[operations/deployment-charts@master] configure datahub to wait for upgrade before starting

Reason:

This was not required for the current upgrade, the blocker was related to the database collation type. https://phabricator.wikimedia.org/T361688#9726571

https://gerrit.wikimedia.org/r/1020295