Page MenuHomePhabricator

Upgrade Datahub to v0.10.4
Closed, ResolvedPublic3 Estimated Story Points

Assigned To
Authored By
EChetty
Feb 13 2023, 1:46 PM
Referenced Files
F37136000: image.png
Jul 11 2023, 2:33 PM
F37135853: image.png
Jul 11 2023, 12:00 PM
F37135199: image.png
Jul 10 2023, 9:53 PM
F37132406: image.png
Jul 7 2023, 6:14 PM
F37132163: image.png
Jul 7 2023, 12:31 PM
F37119912: image.png
Jun 26 2023, 11:26 AM
F37119907: image.png
Jun 26 2023, 11:26 AM
F37114856: image.png
Jun 23 2023, 5:08 PM

Description

ToDo

Follow: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/DataHub/Upgrading

  • Clone wmf datahub
  • Checkout locally and Add the upstream remote if it does not already exist
  • Pull the master branch from the upstream remote.
  • Push the master branch from the upstream repository to our gerrit repository.
  • Push the tags to the remote gerrit repository
  • Make the needed changes
  • Rebase current branch against the tag of the new version v0.10.0
  • Fix merge conflicts
  • Force-push branch to gerrit
  • Create a feature branch in the deployment-charts repository and update the image version in the helm charts
  • Create a feature branch in the packaged-environments repository and update the datahub version

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+9 -2
analytics/datahubwmf+3 -1
operations/deployment-chartsmaster+93 -3
operations/puppetproduction+1 -1
analytics/refinerymaster+1 -1
operations/deployment-chartsmaster+3 -1
operations/deployment-chartsmaster+6 -16
operations/deployment-chartsmaster+3 -1
operations/deployment-chartsmaster+8 -7
operations/puppetproduction+8 -1
operations/puppetproduction+2 -2
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+24 -7
operations/deployment-chartsmaster+6 -6
operations/deployment-chartsmaster+8 -17
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+1 -1
analytics/datahubwmf+2 -2
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+127 -3
operations/deployment-chartsmaster+2 -2
analytics/datahubwmf+2 -2
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
analytics/datahubwmf+1 -1
operations/deployment-chartsmaster+1 -1
analytics/datahubwmf+2 -2
operations/deployment-chartsmaster+12 -12
analytics/datahubwmf+6 -6
analytics/datahubwmf+430 -216
operations/deployment-chartsmaster+2 -8
operations/deployment-chartsmaster+1 -1
analytics/datahubwmf+5 -5
operations/puppetproduction+4 -4
operations/puppetproduction+2 -0
operations/puppetproduction+53 -0
operations/puppetproduction+11 -0
operations/deployment-chartsmaster+1 -1
analytics/datahubwmf+11 -6
operations/deployment-chartsmaster+1 -1
analytics/datahubwmf+89 -45
operations/deployment-chartsmaster+50 -3
analytics/datahubwmf+11 -5
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
analytics/datahubwmf+1 -1
analytics/datahubwmf+66 -878 K
operations/deployment-chartsmaster+4 -2
analytics/datahubwmf+1 -36
operations/deployment-chartsmaster+46 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+6 -2
operations/deployment-chartsmaster+4 -2
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+4 -0
operations/deployment-chartsmaster+2 -5
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+4 -1
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+1 -1
analytics/datahubwmf+6 -0
operations/deployment-chartsmaster+5 -3
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+4 -5
operations/deployment-chartsmaster+8 -8
operations/deployment-chartsmaster+863 -6
operations/deployment-chartsmaster+1 -1
analytics/datahubwmf+55 -2
integration/configmaster+5 -0
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+12 -12
Show related patches Customize query in gerrit
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Bump the version of the datahub packaged environmentrepos/data-engineering/airflow-dags!473btullisbump_datahub_versionmain
Update the schema registry used for airflow lineage in testrepos/data-engineering/airflow-dags!454btullisupdate_schema_registry_datahub_stagingmain
Customize query in GitLab

Related Objects

Mentioned In
T327969: null shown in the user profile dropdown in datahub
R3101:cf7ff2b18f24: Update the MAE and MCE entrypoints
R3101:9f22244af05c: Update the path to the jar file for the MCE and MAE consumer images
R3101:4d0e6d840137: Update the GMS container to address a path issue
R3101:a83bd830a49c: Update the setup-elasticsearch container to fix path issue
R3101:93a1fc4dcc13: Update the datahub-frontend container to fix path issues
R3101:8b32b77187ca: Begin un-forking datahub from the upstream
R3101:a490018a499d: Use an updated version of kafka for the datahub kafka-setup image
R3101:0aa704eb7b21: Fix the kafka-setup container for datahub
R3101:192792646b67: Improve the datahub kafka-setup container
R3101:046939ce79f3: Fix the path to the init.sql file in the mysql-setup container
R3101:d406f50aaae5: Update the kafka-setup conainer of datahub
R3101:ac34586bbe26: Use the latest version of the create_indices script
R3101:d0279c27a068: Update the datahub-upgrade image to include the entity-registry
R3101:83ddf0c8d817: Add a datahub-upgrade container
R3101:72b861ff73cd: Bump to version 0.10.0 of DataHub
R3101:8cf7e304db83: Use an updated version of kafka for the datahub kafka-setup image
R3101:fc142a03d0a9: Fix the kafka-setup container for datahub
R3101:c7ea319e2960: Improve the datahub kafka-setup container
R3101:6d044eca409f: Update the kafka-setup conainer of datahub
R3101:5c392f038b8b: Fix the path to the init.sql file in the mysql-setup container
R3101:aeff181914f9: Use the latest version of the create_indices script
R3101:95319d09691e: Update the datahub-upgrade image to include the entity-registry
R3101:c3edc0313fd8: Add a datahub-upgrade container
T333580: The staging and production deployments of datahub share an Opensearch cluster
R3101:27b3f8b1a805: Build containers v0.10.0
R3101:3fafb00b8f38: Bump to version 0.10.0 of DataHub
R3101:fc50b5a323f0: Bump to version 0.10.0 of DataHub
R3101:5af0bdf861e8: Bump to version 0.10.0 of DataHub
R3101:cdb674884b3d: Bump to version 0.10.0 of DataHub
Mentioned Here
T341464: eqiad: 1 VM requested for karapace in support of datahub in staging
P49498 kafka-setup job output
P49480 datahub-upgrade container output
T303381: Review and improve the build process for DataHub containers
T333580: The staging and production deployments of datahub share an Opensearch cluster

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 936035 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Deploy a new image for the datahub service

https://gerrit.wikimedia.org/r/936035

Change 936035 merged by jenkins-bot:

[operations/deployment-charts@master] Deploy a new image for the datahub service

https://gerrit.wikimedia.org/r/936035

Change 936042 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Enable the datahub systemupdate job

https://gerrit.wikimedia.org/r/936042

Change 936042 merged by jenkins-bot:

[operations/deployment-charts@master] Enable the datahub systemupdate job

https://gerrit.wikimedia.org/r/936042

Change 936061 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Update the path to the jar file for the MCE and MAE consumer images

https://gerrit.wikimedia.org/r/936061

Change 936061 merged by jenkins-bot:

[analytics/datahub@wmf] Update the path to the jar file for the MCE and MAE consumer images

https://gerrit.wikimedia.org/r/936061

Change 936228 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump datahub image and deploy standalone MAE/MCE consumers

https://gerrit.wikimedia.org/r/936228

Change 936228 merged by jenkins-bot:

[operations/deployment-charts@master] Bump datahub image and deploy standalone MAE/MCE consumers

https://gerrit.wikimedia.org/r/936228

Change 936250 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the datahub charts with new environment variables

https://gerrit.wikimedia.org/r/936250

Change 936250 merged by jenkins-bot:

[operations/deployment-charts@master] Update the datahub charts with new environment variables

https://gerrit.wikimedia.org/r/936250

Change 936260 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Update the MAE and MCE entrypoints

https://gerrit.wikimedia.org/r/936260

At long last I have got into the datahub frontend on staging, running version 0.10.4. I had to use the default credentials of datahub:datahub so it looks like the JAAS configuration for LDAP isn't being used yet.

image.png (518×1 px, 56 KB)

BTullis renamed this task from Upgrade Datahub to v0.10.0 to Upgrade Datahub to v0.10.4.Jul 7 2023, 12:31 PM

Change 936271 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Configure datahub-gms not to wait for upgrade before starting

https://gerrit.wikimedia.org/r/936271

Change 936271 merged by jenkins-bot:

[operations/deployment-charts@master] Configure datahub-gms not to wait for upgrade before starting

https://gerrit.wikimedia.org/r/936271

Change 936272 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Fix the path to the jaas configuration file for the datahub-frontend

https://gerrit.wikimedia.org/r/936272

Change 936272 merged by jenkins-bot:

[operations/deployment-charts@master] Fix the path to the jaas configuration file for the datahub-frontend

https://gerrit.wikimedia.org/r/936272

Change 936260 merged by jenkins-bot:

[analytics/datahub@wmf] Update the MAE and MCE entrypoints

https://gerrit.wikimedia.org/r/936260

Change 936295 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Deploy a new datahub image

https://gerrit.wikimedia.org/r/936295

Change 936295 merged by jenkins-bot:

[operations/deployment-charts@master] Deploy a new datahub image

https://gerrit.wikimedia.org/r/936295

Change 936301 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump the version number of the datahub-frontend chart

https://gerrit.wikimedia.org/r/936301

Change 936301 merged by jenkins-bot:

[operations/deployment-charts@master] Bump the version number of the datahub-frontend chart

https://gerrit.wikimedia.org/r/936301

Change 936314 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Fix the datahub frontend authentication

https://gerrit.wikimedia.org/r/936314

Change 936314 merged by jenkins-bot:

[operations/deployment-charts@master] Fix the datahub frontend authentication

https://gerrit.wikimedia.org/r/936314

Change 936324 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update datahub jaas volume name

https://gerrit.wikimedia.org/r/936324

Change 936324 merged by jenkins-bot:

[operations/deployment-charts@master] Update datahub jaas volume name

https://gerrit.wikimedia.org/r/936324

I've fixed the LDAP issue now, so I can log into the staging version of datahub again. Next I have to check that the MAE and MCE consumers are operating correctly.

image.png (556×982 px, 57 KB)

Change 936651 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Use an internal schema registry for datahub on staging

https://gerrit.wikimedia.org/r/936651

Change 936651 merged by jenkins-bot:

[operations/deployment-charts@master] Use an internal schema registry for datahub on staging

https://gerrit.wikimedia.org/r/936651

Change 936656 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Enable the kafka-setup job for datahub in staging

https://gerrit.wikimedia.org/r/936656

I've noticed that the production and staging instances of datahub share a single schema registry, namely karapace1001.eqiad.wmnet:8081

I think that this is likely to cause issue for us relating to incompatible schemas, which is the sort of error I'm seeing at the moment from the MAE consumer job in staging.

So I'm taking the opportunity to switch the staging deployment to use its internal schema registry, instead of karapace. This feature was added by upstream at our request after our initial deployment, but we haven't used it up until this point.

Change 936656 merged by jenkins-bot:

[operations/deployment-charts@master] Enable the kafka-setup job for datahub in staging

https://gerrit.wikimedia.org/r/936656

Change 936658 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump the datahub top-level chart

https://gerrit.wikimedia.org/r/936658

Change 936658 merged by jenkins-bot:

[operations/deployment-charts@master] Bump the datahub top-level chart

https://gerrit.wikimedia.org/r/936658

Oh, now we have a really useful error from the kafka-setup job.

Error while executing config command with args '--command-config /tmp/connection.properties --bootstrap-server kafka-test1006.eqiad.wmnet:9092 --entity-type topics --entity-name DataHubUpgradeHistory_v1 --alter --add-config retention.ms=-1'
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.UnsupportedVersionException: The broker does not support INCREMENTAL_ALTER_CONFIGS

The majority of the script worked perfectly and the topics have been created/verified on the kafka-test cluster. Here's the rest of the output from the kafka-setup job.

btullis@deploy1002:~$ kubectl logs -f datahub-main-kafka-setup-job-fbr4r 
Error: Could not find or load main class io.confluent.admin.utils.cli.KafkaReadyCommand
Caused by: java.lang.ClassNotFoundException: io.confluent.admin.utils.cli.KafkaReadyCommand
Using log4j config /etc/cp-base-new/log4j.properties
/tmp/fifo-qomF
will start 1
will start 2
will start 3
will start 4
worker 2 started
worker 1 started
waiting, started 2 of 4
worker 4 started
worker 3 started
sending MetadataAuditEvent_v4 --partitions 1 --topic MetadataAuditEvent_v4
sending MetadataChangeEvent_v4 --partitions 1 --topic MetadataChangeEvent_v4
sending FailedMetadataChangeEvent_v4 --partitions 1 --topic FailedMetadataChangeEvent_v4
sending MetadataChangeLog_Versioned_v1 --partitions 1 --topic MetadataChangeLog_Versioned_v1
sending MetadataChangeLog_Timeseries_v1 --partitions 1 --config retention.ms=7776000000 --topic MetadataChangeLog_Timeseries_v1
sending MetadataChangeProposal_v1 --partitions 1 --topic MetadataChangeProposal_v1
sending FailedMetadataChangeProposal_v1 --partitions 1 --topic FailedMetadataChangeProposal_v1
sending PlatformEvent_v1 --partitions 1 --topic PlatformEvent_v1
sending DataHubUpgradeHistory_v1 --partitions 1 --config retention.ms=-1 --topic DataHubUpgradeHistory_v1
sending DataHubUsageEvent_v1 --partitions 1 --topic DataHubUsageEvent_v1
2 got work_id=MetadataAuditEvent_v4 topic_args=--partitions 1 --topic MetadataAuditEvent_v4
1 got work_id=MetadataChangeEvent_v4 topic_args=--partitions 1 --topic MetadataChangeEvent_v4
4 got work_id=FailedMetadataChangeEvent_v4 topic_args=--partitions 1 --topic FailedMetadataChangeEvent_v4
3 got work_id=MetadataChangeLog_Versioned_v1 topic_args=--partitions 1 --topic MetadataChangeLog_Versioned_v1
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic MetadataAuditEvent_v4.
2 got work_id=MetadataChangeLog_Timeseries_v1 topic_args=--partitions 1 --config retention.ms=7776000000 --topic MetadataChangeLog_Timeseries_v1
1 got work_id=MetadataChangeProposal_v1 topic_args=--partitions 1 --topic MetadataChangeProposal_v1
Created topic FailedMetadataChangeEvent_v4.
4 got work_id=FailedMetadataChangeProposal_v1 topic_args=--partitions 1 --topic FailedMetadataChangeProposal_v1
3 got work_id=PlatformEvent_v1 topic_args=--partitions 1 --topic PlatformEvent_v1
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic FailedMetadataChangeProposal_v1.
1 got work_id=DataHubUpgradeHistory_v1 topic_args=--partitions 1 --config retention.ms=-1 --topic DataHubUpgradeHistory_v1
Created topic PlatformEvent_v1.
2 got work_id=DataHubUsageEvent_v1 topic_args=--partitions 1 --topic DataHubUsageEvent_v1
4 done working
3 done working
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic DataHubUsageEvent_v1.
2 done working
1 done working
Topic Creation Complete.

The specific command in question is shown here: https://github.com/datahub-project/datahub/blob/v0.10.4/docker/kafka-setup/kafka-setup.sh#L155

It's a workaround for this bug: https://github.com/datahub-project/datahub/issues/7882 which is relevant for us, as we are currently deploying with the environment variable: BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATE: false set from this commit: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/936271

So for now, I will once again disable the kafka-setup job, now that the topics have been created, then proceed with the deploy.

Change 936670 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Disable the kafka-setup job in datahub

https://gerrit.wikimedia.org/r/936670

Change 936670 merged by jenkins-bot:

[operations/deployment-charts@master] Disable the kafka-setup job in datahub

https://gerrit.wikimedia.org/r/936670

Change 936675 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Use plaintext port 8080 for local schema registry in datahub

https://gerrit.wikimedia.org/r/936675

Change 936675 merged by jenkins-bot:

[operations/deployment-charts@master] Use plaintext port 8080 for local schema registry in datahub

https://gerrit.wikimedia.org/r/936675

Initial testing of the internal schema registry for datahub didn't work very well, so rather than proceeding with that right now I'm going to create a second karapace instance in T341464: eqiad: 1 VM requested for karapace in support of datahub in staging

I'll then configure the staging instance of datahub to use this.

Change 936706 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a second karapace VM

https://gerrit.wikimedia.org/r/936706

Change 936706 merged by Btullis:

[operations/puppet@production] Add a second karapace VM

https://gerrit.wikimedia.org/r/936706

Change 936753 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure karapace1001 to use the kafka-jumbo cluster

https://gerrit.wikimedia.org/r/936753

Change 936753 merged by Btullis:

[operations/puppet@production] Configure karapace1001 to use the kafka-jumbo cluster

https://gerrit.wikimedia.org/r/936753

Change 936791 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Configure datahub staging to use the new karapace instance

https://gerrit.wikimedia.org/r/936791

Change 936791 merged by jenkins-bot:

[operations/deployment-charts@master] Configure datahub staging to use the new karapace instance

https://gerrit.wikimedia.org/r/936791

Change 936792 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure the test datahub jobs to use the staging schema registry

https://gerrit.wikimedia.org/r/936792

Change 936793 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Permit staging datahub to access karapace1002

https://gerrit.wikimedia.org/r/936793

Change 936793 merged by jenkins-bot:

[operations/deployment-charts@master] Permit staging datahub to access karapace1002

https://gerrit.wikimedia.org/r/936793

This is a first! I've successfully ingested sample data to the staging deployment of datahub. This is great because it shows that end-to-end ingestion works with 0.10.4.

image.png (1×1 px, 103 KB)

In order to do it, I need the following things:

  • An entry in my /etc/hosts file on my workstation, spoofing my local DNS resolution to resolve datahub-gms.k8s-staging.discovery.wmnet as 127.0.0.1
btullis@marlin:~$ head -n 1 /etc/hosts
127.0.0.1	localhost datahub-frontend.k8s-staging.discovery.wmnet datahub-gms.k8s-staging.discovery.wmnet
  • An SSH tunnel to the ingress endpoint on the (eqiad) staging cluster: ssh -N -L 30443:k8s-ingress-staging.svc.eqiad.wmnet:30443 deploy1002.eqiad.wmnet
  • A conda environment set up on my workstation with pip install acryl-datahub==0.10.4 having been run.
  • A recipe for ingestion that looks like this:
(datahub) btullis@marlin:~/src/datahub-ingestion$ cat datahub.yml 
source:
  type: demo-data
  config: {}
sink:
  type: "datahub-rest"
  config:
    server: "https://datahub-gms.k8s-staging.discovery.wmnet:30443"
    disable_ssl_verification: true

The ingestion was initiated like this:

datahub ingest -c datahub.yml

Thre were some certificate verifiation warnings, but the CLI report from the ingestion run was like this:

Cli report:
{'cli_version': '0.10.4',
 'cli_entry_location': '/home/btullis/miniconda3/envs/datahub/lib/python3.11/site-packages/datahub/__init__.py',
 'py_version': '3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0]',
 'py_exec_path': '/home/btullis/miniconda3/envs/datahub/bin/python',
 'os_details': 'Linux-6.2.0-24-generic-x86_64-with-glibc2.37',
 'peak_memory_usage': '89.17 MB',
 'mem_info': '89.17 MB',
 'peak_disk_usage': '840.21 GB',
 'disk_info': {'total': '981.13 GB', 'used': '840.21 GB', 'free': '91.01 GB'}}
Source (demo-data) report:
{'events_produced': 101,
 'events_produced_per_sec': 21,
 'entities': {'corpuser': ['urn:li:corpuser:datahub', 'urn:li:corpuser:jdoe'],
              'corpGroup': ['urn:li:corpGroup:jdoe', 'urn:li:corpGroup:bfoo'],
              'dataset': ['urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:s3,project/root/events/logging_events_bckp,PROD)'],
              'dataJob': ['urn:li:dataJob:(urn:li:dataFlow:(airflow,dag_abc,PROD),task_123)',
                          'urn:li:dataJob:(urn:li:dataFlow:(airflow,dag_abc,PROD),task_456)'],
              'dataFlow': ['urn:li:dataFlow:(airflow,dag_abc,PROD)'],
              'chart': ['urn:li:chart:(looker,baz1)', 'urn:li:chart:(looker,baz2)'],
              'dashboard': ['urn:li:dashboard:(looker,baz)'],
              'mlModel': ['urn:li:mlModel:(urn:li:dataPlatform:science,scienceModel,PROD)'],
              'tag': ['urn:li:tag:Legacy', 'urn:li:tag:NeedsDocumentation'],
              'dataPlatform': ['urn:li:dataPlatform:adlsGen1',
                               'urn:li:dataPlatform:adlsGen2',
                               'urn:li:dataPlatform:ambry',
                               'urn:li:dataPlatform:couchbase',
                               'urn:li:dataPlatform:hive',
                               'urn:li:dataPlatform:kafka',
                               'urn:li:dataPlatform:snowflake',
                               'urn:li:dataPlatform:redshift',
                               'urn:li:dataPlatform:bigquery',
                               'urn:li:dataPlatform:glue',
                               '... sampled of 27 total elements'],
              'mlPrimaryKey': ['urn:li:mlPrimaryKey:(test_feature_table_all_feature_dtypes,dummy_entity_1)',
                               'urn:li:mlPrimaryKey:(test_feature_table_all_feature_dtypes,dummy_entity_2)',
                               'urn:li:mlPrimaryKey:(test_feature_table_no_labels,dummy_entity_2)',
                               'urn:li:mlPrimaryKey:(test_feature_table_single_feature,dummy_entity_1)',
                               'urn:li:mlPrimaryKey:(user_features,user_name)',
                               'urn:li:mlPrimaryKey:(user_features,user_id)',
                               'urn:li:mlPrimaryKey:(user_analytics,user_name)'],
              'mlFeature': ['urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_BOOL_LIST_feature)',
                            'urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_DOUBLE_LIST_feature)',
                            'urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_INT32_LIST_feature)',
                            'urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_INT32_feature)',
                            'urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_INT64_feature)',
                            'urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_STRING_LIST_feature)',
                            'urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_STRING_feature)',
                            'urn:li:mlFeature:(test_feature_table_no_labels,test_BYTES_feature)',
                            'urn:li:mlFeature:(test_feature_table_single_feature,test_BYTES_feature)',
                            'urn:li:mlFeature:(user_analytics,date_joined)',
                            '... sampled of 20 total elements'],
              'mlFeatureTable': ['urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,test_feature_table_all_feature_dtypes)',
                                 'urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,test_feature_table_no_labels)',
                                 'urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,test_feature_table_single_feature)',
                                 'urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,user_features)',
                                 'urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,user_analytics)'],
              'glossaryTerm': ['urn:li:glossaryTerm:CustomerAccount', 'urn:li:glossaryTerm:SavingAccount', 'urn:li:glossaryTerm:AccountBalance'],
              'glossaryNode': ['urn:li:glossaryNode:ClientsAndAccounts'],
              'container': ['urn:li:container:DATABASE', 'urn:li:container:SCHEMA'],
              'assertion': ['urn:li:assertion:358c683782c93c2fc2bd4bdd4fdb0153'],
              'query': ['urn:li:query:test-query']},
 'aspects': {'corpuser': {'corpUserInfo': 2, 'corpUserStatus': 1, 'status': 2},
             'corpGroup': {'corpGroupInfo': 2, 'status': 2},
             'dataset': {'browsePaths': 2,
                         'datasetProperties': 5,
                         'ownership': 7,
                         'institutionalMemory': 6,
                         'schemaMetadata': 7,
                         'status': 7,
                         'upstreamLineage': 6,
                         'editableSchemaMetadata': 1,
                         'globalTags': 1,
                         'datasetProfile': 2,
                         'operation': 2,
                         'datasetUsageStatistics': 1,
                         'container': 1},
             'dataJob': {'status': 2, 'ownership': 2, 'dataJobInfo': 2, 'dataJobInputOutput': 2},
             'dataFlow': {'status': 1, 'ownership': 1, 'dataFlowInfo': 1},
             'chart': {'status': 2, 'chartInfo': 2, 'globalTags': 1},
             'dashboard': {'status': 1, 'ownership': 1, 'dashboardInfo': 1},
             'mlModel': {'ownership': 1,
                         'mlModelProperties': 1,
                         'mlModelTrainingData': 1,
                         'mlModelEvaluationData': 1,
                         'institutionalMemory': 1,
                         'intendedUse': 1,
                         'mlModelMetrics': 1,
                         'mlModelEthicalConsiderations': 1,
                         'mlModelCaveatsAndRecommendations': 1,
                         'status': 1,
                         'cost': 1},
             'tag': {'status': 2, 'tagProperties': 2, 'ownership': 2},
             'dataPlatform': {'dataPlatformInfo': 27},
             'mlPrimaryKey': {'status': 7, 'mlPrimaryKeyProperties': 7},
             'mlFeature': {'status': 20, 'mlFeatureProperties': 20},
             'mlFeatureTable': {'status': 5, 'browsePaths': 5, 'mlFeatureTableProperties': 5},
             'glossaryTerm': {'status': 3, 'glossaryTermInfo': 3, 'ownership': 3},
             'glossaryNode': {'glossaryNodeInfo': 1, 'ownership': 1, 'status': 1},
             'container': {'containerProperties': 2, 'subTypes': 2, 'dataPlatformInstance': 2, 'container': 1},
             'assertion': {'assertionInfo': 1, 'dataPlatformInstance': 1, 'assertionRunEvent': 1},
             'query': {'queryProperties': 1, 'querySubjects': 1}},
 'warnings': {},
 'failures': {},
 'total_num_files': 1,
 'num_files_completed': 1,
 'files_completed': ['/tmp/tmp6j7gd0nf.json'],
 'percentage_completion': '0%',
 'estimated_time_to_completion_in_minutes': -1,
 'total_bytes_read_completed_files': 120035,
 'current_file_size': 120035,
 'total_parse_time_in_seconds': 0.0,
 'total_count_time_in_seconds': 0.0,
 'total_deserialize_time_in_seconds': 0.0,
 'aspect_counts': {'datasetProfile': 2,
                   'operation': 2,
                   'datasetUsageStatistics': 1,
                   'containerProperties': 2,
                   'subTypes': 2,
                   'dataPlatformInstance': 3,
                   'container': 2,
                   'assertionInfo': 1,
                   'assertionRunEvent': 1,
                   'queryProperties': 1,
                   'querySubjects': 1},
 'entity_type_counts': {'dataset': 6, 'container': 7, 'assertion': 3, 'query': 2},
 'start_time': '2023-07-10 22:26:16.330640 (4.61 seconds ago)',
 'running_time': '4.61 seconds'}
Sink (datahub-rest) report:
{'total_records_written': 101,
 'records_written_per_second': 15,
 'warnings': [],
 'failures': [],
 'start_time': '2023-07-10 22:26:14.625833 (6.32 seconds ago)',
 'current_time': '2023-07-10 22:26:20.946109 (now)',
 'total_duration_in_seconds': 6.32,
 'gms_version': 'v0.10.4',
 'pending_requests': 0}

 Pipeline finished successfully; produced 101 events in 4.61 seconds.

I'm going to aim for an upgrade of the production deployments tomorrow at approximately 10:00 UTC.

I'll take a mydumper backup of the database on an-coord1001 before I start, in case I need to roll back.
I'll also take a backup of /srv/opensearch on each of datahubsearch100[1-3] as well, in case I need to roll back that component.

Change 937057 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Enable the required upgrade jobs for datahub in production

https://gerrit.wikimedia.org/r/937057

I've taken an on-disk backup of each of the datahubsearch nodes in seuqnce by doing the following:

sudo depool
sudo systemctl stop opensearch_1@datahub.service
sudo tar czf ~/datahubsearch1001_backup_T329514.tar.gz /srv/opensearch/
sudo systemctl start opensearch_1@datahub.service
sudo pool

I know it's not 100% consistent to do them all in sequence like this, but I think it would be good enough.

I created a backup of the production datahub database with the following command on db1108

btullis@db1108:~$ sudo mysqldump -S /var/run/mysqld/mysqld.analytics_meta.sock --single-transaction --databases datahub >> ~/datahub_backup_T329514.sql
btullis@db1108:~$ ls -lh datahub_backup_T329514.sql 
-rw-r--r-- 1 btullis wikidev 120M Jul 11 10:19 datahub_backup_T329514.sql

I'm still going to have to configure kafka manually, since we have errors from the kafka-setup job.
The list of topics is:

MetadataAuditEvent_v4
MetadataChangeEvent_v4
FailedMetadataChangeEvent_v4
MetadataChangeLog_Versioned_v1
MetadataChangeLog_Timeseries_v1
MetadataChangeProposal_v1
FailedMetadataChangeProposal_v1
PlatformEvent_v1
DataHubUpgradeHistory_v1
DataHubUsageEvent_v1

I can see that they all exist on kafka-jumbo, except DataHubUpgradeHistory_v1

btullis@kafka-jumbo1001:~$ for t in $(cat datahub-topics.txt); do kafka topics --describe --topic $t; done
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic MetadataAuditEvent_v4
Topic:MetadataAuditEvent_v4	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: MetadataAuditEvent_v4	Partition: 0	Leader: 1005	Replicas: 1005,1003,1006	Isr: 1003,1005,1006
	Topic: MetadataAuditEvent_v4	Partition: 1	Leader: 1008	Replicas: 1008,1004,1002	Isr: 1002,1004,1008
	Topic: MetadataAuditEvent_v4	Partition: 2	Leader: 1007	Replicas: 1007,1006,1002	Isr: 1002,1006,1007
	Topic: MetadataAuditEvent_v4	Partition: 3	Leader: 1009	Replicas: 1009,1002,1005	Isr: 1002,1005,1009
	Topic: MetadataAuditEvent_v4	Partition: 4	Leader: 1001	Replicas: 1001,1005,1008	Isr: 1001,1005,1008
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic MetadataChangeEvent_v4
Topic:MetadataChangeEvent_v4	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: MetadataChangeEvent_v4	Partition: 0	Leader: 1003	Replicas: 1003,1004,1006	Isr: 1003,1004,1006
	Topic: MetadataChangeEvent_v4	Partition: 1	Leader: 1004	Replicas: 1004,1006,1002	Isr: 1002,1004,1006
	Topic: MetadataChangeEvent_v4	Partition: 2	Leader: 1006	Replicas: 1006,1002,1005	Isr: 1002,1005,1006
	Topic: MetadataChangeEvent_v4	Partition: 3	Leader: 1002	Replicas: 1002,1005,1008	Isr: 1002,1005,1008
	Topic: MetadataChangeEvent_v4	Partition: 4	Leader: 1005	Replicas: 1005,1008,1001	Isr: 1001,1005,1008
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic FailedMetadataChangeEvent_v4
Topic:FailedMetadataChangeEvent_v4	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: FailedMetadataChangeEvent_v4	Partition: 0	Leader: 1009	Replicas: 1009,1001,1003	Isr: 1001,1003,1009
	Topic: FailedMetadataChangeEvent_v4	Partition: 1	Leader: 1001	Replicas: 1001,1003,1004	Isr: 1001,1003,1004
	Topic: FailedMetadataChangeEvent_v4	Partition: 2	Leader: 1003	Replicas: 1003,1004,1006	Isr: 1003,1004,1006
	Topic: FailedMetadataChangeEvent_v4	Partition: 3	Leader: 1004	Replicas: 1004,1006,1002	Isr: 1002,1004,1006
	Topic: FailedMetadataChangeEvent_v4	Partition: 4	Leader: 1006	Replicas: 1006,1002,1005	Isr: 1002,1005,1006
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic MetadataChangeLog_Versioned_v1
Topic:MetadataChangeLog_Versioned_v1	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: MetadataChangeLog_Versioned_v1	Partition: 0	Leader: 1001	Replicas: 1001,1005,1008	Isr: 1001,1005,1008
	Topic: MetadataChangeLog_Versioned_v1	Partition: 1	Leader: 1003	Replicas: 1003,1008,1007	Isr: 1003,1007,1008
	Topic: MetadataChangeLog_Versioned_v1	Partition: 2	Leader: 1004	Replicas: 1004,1009,1001	Isr: 1001,1004,1009
	Topic: MetadataChangeLog_Versioned_v1	Partition: 3	Leader: 1006	Replicas: 1006,1001,1003	Isr: 1001,1003,1006
	Topic: MetadataChangeLog_Versioned_v1	Partition: 4	Leader: 1002	Replicas: 1002,1003,1004	Isr: 1002,1003,1004
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic MetadataChangeLog_Timeseries_v1
Topic:MetadataChangeLog_Timeseries_v1	PartitionCount:5	ReplicationFactor:3	Configs:retention.ms=7776000000
	Topic: MetadataChangeLog_Timeseries_v1	Partition: 0	Leader: 1005	Replicas: 1005,1003,1006	Isr: 1003,1005,1006
	Topic: MetadataChangeLog_Timeseries_v1	Partition: 1	Leader: 1008	Replicas: 1008,1004,1002	Isr: 1002,1004,1008
	Topic: MetadataChangeLog_Timeseries_v1	Partition: 2	Leader: 1007	Replicas: 1007,1006,1002	Isr: 1002,1006,1007
	Topic: MetadataChangeLog_Timeseries_v1	Partition: 3	Leader: 1009	Replicas: 1009,1002,1005	Isr: 1002,1005,1009
	Topic: MetadataChangeLog_Timeseries_v1	Partition: 4	Leader: 1001	Replicas: 1001,1005,1008	Isr: 1001,1005,1008
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic MetadataChangeProposal_v1
Topic:MetadataChangeProposal_v1	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: MetadataChangeProposal_v1	Partition: 0	Leader: 1002	Replicas: 1002,1003,1004	Isr: 1002,1003,1004
	Topic: MetadataChangeProposal_v1	Partition: 1	Leader: 1005	Replicas: 1005,1003,1006	Isr: 1003,1005,1006
	Topic: MetadataChangeProposal_v1	Partition: 2	Leader: 1008	Replicas: 1008,1004,1002	Isr: 1002,1004,1008
	Topic: MetadataChangeProposal_v1	Partition: 3	Leader: 1007	Replicas: 1007,1006,1002	Isr: 1002,1006,1007
	Topic: MetadataChangeProposal_v1	Partition: 4	Leader: 1009	Replicas: 1009,1002,1005	Isr: 1002,1005,1009
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic FailedMetadataChangeProposal_v1
Topic:FailedMetadataChangeProposal_v1	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: FailedMetadataChangeProposal_v1	Partition: 0	Leader: 1005	Replicas: 1005,1003,1006	Isr: 1003,1005,1006
	Topic: FailedMetadataChangeProposal_v1	Partition: 1	Leader: 1008	Replicas: 1008,1004,1002	Isr: 1002,1004,1008
	Topic: FailedMetadataChangeProposal_v1	Partition: 2	Leader: 1007	Replicas: 1007,1006,1002	Isr: 1002,1006,1007
	Topic: FailedMetadataChangeProposal_v1	Partition: 3	Leader: 1009	Replicas: 1009,1002,1005	Isr: 1002,1005,1009
	Topic: FailedMetadataChangeProposal_v1	Partition: 4	Leader: 1001	Replicas: 1001,1005,1008	Isr: 1001,1005,1008
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic PlatformEvent_v1
Topic:PlatformEvent_v1	PartitionCount:1	ReplicationFactor:3	Configs:
	Topic: PlatformEvent_v1	Partition: 0	Leader: 1005	Replicas: 1005,1003,1006	Isr: 1003,1005,1006
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic DataHubUpgradeHistory_v1
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic DataHubUsageEvent_v1
Topic:DataHubUsageEvent_v1	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: DataHubUsageEvent_v1	Partition: 0	Leader: 1002	Replicas: 1002,1005,1008	Isr: 1002,1005,1008
	Topic: DataHubUsageEvent_v1	Partition: 1	Leader: 1005	Replicas: 1005,1008,1001	Isr: 1001,1005,1008
	Topic: DataHubUsageEvent_v1	Partition: 2	Leader: 1008	Replicas: 1008,1007,1001	Isr: 1001,1007,1008
	Topic: DataHubUsageEvent_v1	Partition: 3	Leader: 1007	Replicas: 1007,1009,1001	Isr: 1001,1007,1009
	Topic: DataHubUsageEvent_v1	Partition: 4	Leader: 1009	Replicas: 1009,1001,1003	Isr: 1001,1003,1009

They all have five partitions, apart from PlatformEvent_v1 which has one partition.

I have created the missing topic with:

btullis@kafka-jumbo1001:~$ kafka topics --create --if-not-exists --partitions 1 --replication-factor 3 --config retention.ms=-1 --topic DataHubUpgradeHistory_v1
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --create --if-not-exists --partitions 1 --replication-factor 3 --config retention.ms=-1 --topic DataHubUpgradeHistory_v1
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic "DataHubUpgradeHistory_v1".

Verified its existence with:

btullis@kafka-jumbo1001:~$ kafka topics --describe --topic DataHubUpgradeHistory_v1
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic DataHubUpgradeHistory_v1
Topic:DataHubUpgradeHistory_v1	PartitionCount:1	ReplicationFactor:3	Configs:retention.ms=-1
	Topic: DataHubUpgradeHistory_v1	Partition: 0	Leader: 1007	Replicas: 1007,1009,1001	Isr: 1007,1009,1001

I'm also checking that the schema cleanup policy is correct on the _schemas topic.

btullis@kafka-jumbo1001:~$ kafka topics --describe --topic _schemas
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic _schemas
Topic:_schemas	PartitionCount:1	ReplicationFactor:1	Configs:cleanup.policy=compact
	Topic: _schemas	Partition: 0	Leader: 1007	Replicas: 1007	Isr: 1007

This is correct and has the value: Configs:cleanup.policy=compact - This matches what the kafka-setup.sh script would set here, if we were able to support this configuration.

Also, according to the kafka-setup.sh script there is no reason why PlatformEvent_v1 should only have one partition. I have tried altering that with:

btullis@kafka-jumbo1001:~$ kafka topics --alter partitions=5 --topic PlatformEvent_v1
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --alter partitions=5 --topic PlatformEvent_v1

...but it didn't seem to take:

btullis@kafka-jumbo1001:~$ kafka topics --describe --topic PlatformEvent_v1
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic PlatformEvent_v1
Topic:PlatformEvent_v1	PartitionCount:1	ReplicationFactor:3	Configs:
	Topic: PlatformEvent_v1	Partition: 0	Leader: 1005	Replicas: 1005,1003,1006	Isr: 1003,1005,1006

Oh well, I'll come back to that.

Change 937057 merged by jenkins-bot:

[operations/deployment-charts@master] Enable the required upgrade jobs for datahub in production

https://gerrit.wikimedia.org/r/937057

The upgrade has gone well, I think. The only this is that it looks like the sample data I ingested into the staging instance yesterday ended up in production too.

image.png (985×1 px, 94 KB)

I've run a restore-indices-job to see if I can cleanse these.
I couldn't run the job as the normal datahub user, due to a permissions issue.

btullis@deploy1002:~$ kubectl create job --from=cronjob/datahub-main-restore-indices-job-template datahub-restore-indices-job
error: failed to create job: jobs.batch is forbidden: User "datahub" cannot create resource "jobs" in API group "batch" in the namespace "datahub"

So I created the job using the admin user.

btullis@deploy1002:~$ sudo -i
root@deploy1002:~# kube-env admin eqiad
root@deploy1002:~# kubectl create job -n datahub --from=cronjob/datahub-main-restore-indices-job-template datahub-restore-indices-job
job.batch/datahub-restore-indices-job created
root@deploy1002:~# logout

Oh, it looks like the cleanup step hasn't been requested.

2023-07-11 11:54:47,375 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Starting upgrade with id RestoreIndices...
2023-07-11 11:54:47,375 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Cleanup has not been requested.
2023-07-11 11:54:47,375 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Skipping Step 1/3: ClearSearchServiceStep...
2023-07-11 11:54:47,375 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Cleanup has not been requested.
2023-07-11 11:54:47,375 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Skipping Step 2/3: ClearGraphServiceStep...
2023-07-11 11:54:47,375 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Executing Step 3/3: SendMAEStep...
2023-07-11 11:54:47,383 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Sending MAE from local DB

So it makes sure that everything is present, but it hasn't removed entries from the search indices that aren't in the database. I'll see if it's possible to do this any other way.

Change 937099 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add the option to clean datahub indices to the restore job

https://gerrit.wikimedia.org/r/937099

Change 937099 merged by jenkins-bot:

[operations/deployment-charts@master] Add the option to clean datahub indices to the restore job

https://gerrit.wikimedia.org/r/937099

This looks better. It's deleting the existing indices and than issuing and MAE for each aspect.

2023-07-11 14:27:18,884 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Starting upgrade with id RestoreIndices...
2023-07-11 14:27:18,885 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Executing Step 1/3: ClearSearchServiceStep...
2023-07-11 14:27:22,088 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Completed Step 1/3: ClearSearchServiceStep successfully.
2023-07-11 14:27:22,089 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Executing Step 2/3: ClearGraphServiceStep...
2023-07-11 14:27:22,980 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Completed Step 2/3: ClearGraphServiceStep successfully.
2023-07-11 14:27:22,980 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Executing Step 3/3: SendMAEStep...
2023-07-11 14:27:22,981 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Sending MAE from local DB
2023-07-11 14:27:23,280 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Found 39667 latest aspects in aspects table in 0.00 minutes.
2023-07-11 14:27:23,289 [pool-12-thread-1] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Args are RestoreIndicesArgs(start=0, batchSize=1000, numThreads=1, batchDelayMs=100, aspectName=null, urn=null, urnLike=null)
2023-07-11 14:27:23,289 [pool-12-thread-1] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Reading rows 0 through 1000 from the aspects table started.
2023-07-11 14:27:23,292 [pool-12-thread-1] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Reading rows 0 through 1000 from the aspects table completed.
2023-07-11 14:27:30,375 [pool-12-thread-1] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Args are RestoreIndicesArgs(start=1000, batchSize=1000, numThreads=1, batchDelayMs=100, aspectName=null, urn=null, urnLike=null)
2023-07-11 14:27:30,375 [pool-12-thread-1] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Reading rows 1000 through 2000 from the aspects table started.
2023-07-11 14:27:30,375 [pool-12-thread-1] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Reading rows 1000 through 2000 from the aspects table completed.
...

Success! The cleanup job has successfully removed all of the errant data from elasticsearch and rebuilt the indices.

image.png (1×1 px, 101 KB)

Change 937137 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/refinery@master] Update the datahub packaged environment to v0.10.4

https://gerrit.wikimedia.org/r/937137

@Milimetric - I could do with your help to update the conda environment for the datahub client please, if you have a moment.
I can see the README but I'm getting confused about some of the details.

Change 937137 merged by Milimetric:

[analytics/refinery@master] Update the datahub packaged environment to v0.10.4

https://gerrit.wikimedia.org/r/937137

Change 936792 merged by Btullis:

[operations/puppet@production] Configure the test datahub jobs to use the staging schema registry

https://gerrit.wikimedia.org/r/936792

Change 938214 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add missing global values to the datahub subcharts to fix CI

https://gerrit.wikimedia.org/r/938214

Change 938214 merged by jenkins-bot:

[operations/deployment-charts@master] Add missing global values to the datahub subcharts to fix CI

https://gerrit.wikimedia.org/r/938214

Change 898956 abandoned by Stevemunene:

[analytics/datahub@wmf] Build datahub v0.10.0 containers

Reason:

datahub was upgraded to v0.14.0

https://gerrit.wikimedia.org/r/898956

Change 943549 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add a missing environment variable to datahub/mae-consumer

https://gerrit.wikimedia.org/r/943549

Change 943549 merged by jenkins-bot:

[operations/deployment-charts@master] Add a missing environment variable to datahub/mae-consumer

https://gerrit.wikimedia.org/r/943549