⚓ T329514 Upgrade Datahub to v0.10.4

Subject	Repo	Branch	Lines +/-
Add a missing environment variable to datahub/mae-consumer	operations/deployment-charts	master	+9 -2
Build datahub v0.10.0 containers	analytics/datahub	wmf	+3 -1
Add missing global values to the datahub subcharts to fix CI	operations/deployment-charts	master	+93 -3
Configure the test datahub jobs to use the staging schema registry	operations/puppet	production	+1 -1
Update the datahub packaged environment to v0.10.4	analytics/refinery	master	+1 -1
Add the option to clean datahub indices to the restore job	operations/deployment-charts	master	+3 -1
Enable the required upgrade jobs for datahub in production	operations/deployment-charts	master	+6 -16
Permit staging datahub to access karapace1002	operations/deployment-charts	master	+3 -1
Configure datahub staging to use the new karapace instance	operations/deployment-charts	master	+8 -7
Configure karapace1001 to use the kafka-jumbo cluster	operations/puppet	production	+8 -1
Add a second karapace VM	operations/puppet	production	+2 -2
Use plaintext port 8080 for local schema registry in datahub	operations/deployment-charts	master	+3 -3
Disable the kafka-setup job in datahub	operations/deployment-charts	master	+1 -1
Bump the datahub top-level chart	operations/deployment-charts	master	+1 -1
Enable the kafka-setup job for datahub in staging	operations/deployment-charts	master	+1 -1
Use an internal schema registry for datahub on staging	operations/deployment-charts	master	+24 -7
Update datahub jaas volume name	operations/deployment-charts	master	+6 -6
Fix the datahub frontend authentication	operations/deployment-charts	master	+8 -17
Bump the version number of the datahub-frontend chart	operations/deployment-charts	master	+2 -2
Deploy a new datahub image	operations/deployment-charts	master	+1 -1
Update the MAE and MCE entrypoints	analytics/datahub	wmf	+2 -2
Fix the path to the jaas configuration file for the datahub-frontend	operations/deployment-charts	master	+4 -4
Configure datahub-gms not to wait for upgrade before starting	operations/deployment-charts	master	+1 -0
Update the datahub charts with new environment variables	operations/deployment-charts	master	+127 -3
Bump datahub image and deploy standalone MAE/MCE consumers	operations/deployment-charts	master	+2 -2
Update the path to the jar file for the MCE and MAE consumer images	analytics/datahub	wmf	+2 -2
Enable the datahub systemupdate job	operations/deployment-charts	master	+1 -1
Deploy a new image for the datahub service	operations/deployment-charts	master	+1 -1
Update the GMS container to address a path issue	analytics/datahub	wmf	+1 -1
Bump the version of the datahub image	operations/deployment-charts	master	+1 -1
Update the setup-elasticsearch container to fix path issue	analytics/datahub	wmf	+2 -2
Bump the image of datahub to the new 0.10.4 containers	operations/deployment-charts	master	+12 -12
Update the datahub-frontend container to fix path issues	analytics/datahub	wmf	+6 -6
Begin un-forking datahub from the upstream	analytics/datahub	wmf	+430 -216
Remove the GMS SSL and port options from the datahub GMS chart	operations/deployment-charts	master	+2 -8
Deploy a new version of the datahub images	operations/deployment-charts	master	+1 -1
Use an updated version of kafka for the datahub kafka-setup image	analytics/datahub	wmf	+5 -5
Correct a problem with the confluent7 component updating	operations/puppet	production	+4 -4
Configure the confluent7 component for reprepro updates	operations/puppet	production	+2 -0
Add the GPG key for the Confluent Platform 7 repository	operations/puppet	production	+53 -0
Add an apt mirror for the confluent-kafka 7.4 release	operations/puppet	production	+11 -0
Bump the datahub image	operations/deployment-charts	master	+1 -1
Fix the kafka-setup container for datahub	analytics/datahub	wmf	+11 -6
Deploy updated datahub images	operations/deployment-charts	master	+1 -1
Improve the datahub kafka-setup container	analytics/datahub	wmf	+89 -45
Enable the kafka-setup job for datahub in staging.	operations/deployment-charts	master	+50 -3
Update the kafka-setup conainer of datahub	analytics/datahub	wmf	+11 -5
Disable the datahub upgrade job	operations/deployment-charts	master	+1 -1
Bump the version of the datahub image in use	operations/deployment-charts	master	+1 -1
Fix the path to the init.sql file in the mysql-setup container	analytics/datahub	wmf	+1 -1
Experimental refactor of the datahub container build process	analytics/datahub	wmf	+66 -878 K
datahub: Use new image and fix elasticsearch setup	operations/deployment-charts	master	+4 -2
Use the latest version of the create_indices script	analytics/datahub	wmf	+1 -36
Enable the datahub setup jobs for mysql and elasticsearch	operations/deployment-charts	master	+46 -1
Re-enable the use of TLS for datahub's database connection in staging	operations/deployment-charts	master	+1 -1
Configure the datahub-gms server to use SSL when a client of itself	operations/deployment-charts	master	+6 -2
Add an evironment variable to datahub-gms for its port	operations/deployment-charts	master	+4 -2
Configure datahub-upgrade to use HTTPS to communicate with the GMS	operations/deployment-charts	master	+3 -3
Bump the version of the datahub chart that is deployed	operations/deployment-charts	master	+1 -1
Configure the datahub-upgrade jobs to use TLS to contact the GMS	operations/deployment-charts	master	+4 -0
Revert changes to the GMS networkpolicy in datahub	operations/deployment-charts	master	+2 -5
Bump datahub top-level chart version	operations/deployment-charts	master	+1 -1
Permit datahub batch jobs to contact the GMS service	operations/deployment-charts	master	+4 -1
Enable the service mesh for the top-level datahub deployment	operations/deployment-charts	master	+3 -0
Specify the schema registry type for datahub	operations/deployment-charts	master	+1 -0
Bump the version of the datahub image that is deployed	operations/deployment-charts	master	+1 -1
Update the datahub-upgrade image to include the entity-registry	analytics/datahub	wmf	+6 -0
Fix the networkpolicy selector for datahub maintenance jobs	operations/deployment-charts	master	+5 -3
Enable the networkpolicy for datahub batch jobs	operations/deployment-charts	master	+4 -4
Ensure that the datahub secrets are available when upgrading releases	operations/deployment-charts	master	+4 -5
Fix some naming issues with the datahub-upgrade jobs	operations/deployment-charts	master	+8 -8
Add support for upgrading datahub	operations/deployment-charts	master	+863 -6
Update the container image used to run datahub	operations/deployment-charts	master	+1 -1
Add a datahub-upgrade container	analytics/datahub	wmf	+55 -2
Add pipelines for a datahub-upgrade container	integration/config	master	+5 -0
Remove the hyphen from the datahub staging elasticsearch prefix	operations/deployment-charts	master	+1 -1
Run the datahub consumers in the GMS context	operations/deployment-charts	master	+1 -1
Bump datahub version to 0.10.0 and re-enable standalone consumers	operations/deployment-charts	master	+12 -12

	Title	Reference	Author	Source Branch	Dest Branch
	Bump the version of the datahub packaged environment	repos/data-engineering/airflow-dags!473	btullis	bump_datahub_version	main
	Update the schema registry used for airflow lineage in test	repos/data-engineering/airflow-dags!454	btullis	update_schema_registry_datahub_staging	main

Change 936035 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Deploy a new image for the datahub service

https://gerrit.wikimedia.org/r/936035

Change 936035 merged by jenkins-bot:

[operations/deployment-charts@master] Deploy a new image for the datahub service

https://gerrit.wikimedia.org/r/936035

Change 936042 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Enable the datahub systemupdate job

https://gerrit.wikimedia.org/r/936042

Change 936042 merged by jenkins-bot:

[operations/deployment-charts@master] Enable the datahub systemupdate job

https://gerrit.wikimedia.org/r/936042

Change 936061 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Update the path to the jar file for the MCE and MAE consumer images

https://gerrit.wikimedia.org/r/936061

Change 936061 merged by jenkins-bot:

[analytics/datahub@wmf] Update the path to the jar file for the MCE and MAE consumer images

https://gerrit.wikimedia.org/r/936061

BTullis mentioned this in R3101:9f22244af05c: Update the path to the jar file for the MCE and MAE consumer images.Jul 6 2023, 5:16 PM

Change 936228 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump datahub image and deploy standalone MAE/MCE consumers

https://gerrit.wikimedia.org/r/936228

Change 936228 merged by jenkins-bot:

[operations/deployment-charts@master] Bump datahub image and deploy standalone MAE/MCE consumers

https://gerrit.wikimedia.org/r/936228

Change 936250 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the datahub charts with new environment variables

https://gerrit.wikimedia.org/r/936250

Change 936250 merged by jenkins-bot:

[operations/deployment-charts@master] Update the datahub charts with new environment variables

https://gerrit.wikimedia.org/r/936250

Change 936260 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Update the MAE and MCE entrypoints

https://gerrit.wikimedia.org/r/936260

At long last I have got into the datahub frontend on staging, running version 0.10.4. I had to use the default credentials of datahub:datahub so it looks like the JAAS configuration for LDAP isn't being used yet.

BTullis renamed this task from Upgrade Datahub to v0.10.0 to Upgrade Datahub to v0.10.4.Jul 7 2023, 12:31 PM

Change 936271 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Configure datahub-gms not to wait for upgrade before starting

https://gerrit.wikimedia.org/r/936271

Change 936271 merged by jenkins-bot:

[operations/deployment-charts@master] Configure datahub-gms not to wait for upgrade before starting

https://gerrit.wikimedia.org/r/936271

Change 936272 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Fix the path to the jaas configuration file for the datahub-frontend

https://gerrit.wikimedia.org/r/936272

Change 936272 merged by jenkins-bot:

[operations/deployment-charts@master] Fix the path to the jaas configuration file for the datahub-frontend

https://gerrit.wikimedia.org/r/936272

Change 936260 merged by jenkins-bot:

[analytics/datahub@wmf] Update the MAE and MCE entrypoints

https://gerrit.wikimedia.org/r/936260

BTullis mentioned this in R3101:cf7ff2b18f24: Update the MAE and MCE entrypoints.Jul 7 2023, 1:12 PM

Change 936295 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Deploy a new datahub image

https://gerrit.wikimedia.org/r/936295

Change 936295 merged by jenkins-bot:

[operations/deployment-charts@master] Deploy a new datahub image

https://gerrit.wikimedia.org/r/936295

Change 936301 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump the version number of the datahub-frontend chart

https://gerrit.wikimedia.org/r/936301

Change 936301 merged by jenkins-bot:

[operations/deployment-charts@master] Bump the version number of the datahub-frontend chart

https://gerrit.wikimedia.org/r/936301

Change 936314 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Fix the datahub frontend authentication

https://gerrit.wikimedia.org/r/936314

Change 936314 merged by jenkins-bot:

[operations/deployment-charts@master] Fix the datahub frontend authentication

https://gerrit.wikimedia.org/r/936314

Change 936324 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update datahub jaas volume name

https://gerrit.wikimedia.org/r/936324

Change 936324 merged by jenkins-bot:

[operations/deployment-charts@master] Update datahub jaas volume name

https://gerrit.wikimedia.org/r/936324

I've fixed the LDAP issue now, so I can log into the staging version of datahub again. Next I have to check that the MAE and MCE consumers are operating correctly.

Change 936651 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Use an internal schema registry for datahub on staging

https://gerrit.wikimedia.org/r/936651

Change 936651 merged by jenkins-bot:

[operations/deployment-charts@master] Use an internal schema registry for datahub on staging

https://gerrit.wikimedia.org/r/936651

Maintenance_bot removed a project: Patch-For-Review.Jul 10 2023, 9:30 AM

Change 936656 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Enable the kafka-setup job for datahub in staging

https://gerrit.wikimedia.org/r/936656

gerritbot added a project: Patch-For-Review.Jul 10 2023, 9:34 AM

I've noticed that the production and staging instances of datahub share a single schema registry, namely karapace1001.eqiad.wmnet:8081

I think that this is likely to cause issue for us relating to incompatible schemas, which is the sort of error I'm seeing at the moment from the MAE consumer job in staging.

So I'm taking the opportunity to switch the staging deployment to use its internal schema registry, instead of karapace. This feature was added by upstream at our request after our initial deployment, but we haven't used it up until this point.

Change 936656 merged by jenkins-bot:

[operations/deployment-charts@master] Enable the kafka-setup job for datahub in staging

https://gerrit.wikimedia.org/r/936656

Change 936658 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump the datahub top-level chart

https://gerrit.wikimedia.org/r/936658

Change 936658 merged by jenkins-bot:

[operations/deployment-charts@master] Bump the datahub top-level chart

https://gerrit.wikimedia.org/r/936658

Oh, now we have a really useful error from the kafka-setup job.

Error while executing config command with args '--command-config /tmp/connection.properties --bootstrap-server kafka-test1006.eqiad.wmnet:9092 --entity-type topics --entity-name DataHubUpgradeHistory_v1 --alter --add-config retention.ms=-1'
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.UnsupportedVersionException: The broker does not support INCREMENTAL_ALTER_CONFIGS

The majority of the script worked perfectly and the topics have been created/verified on the kafka-test cluster. Here's the rest of the output from the kafka-setup job.

btullis@deploy1002:~$ kubectl logs -f datahub-main-kafka-setup-job-fbr4r 
Error: Could not find or load main class io.confluent.admin.utils.cli.KafkaReadyCommand
Caused by: java.lang.ClassNotFoundException: io.confluent.admin.utils.cli.KafkaReadyCommand
Using log4j config /etc/cp-base-new/log4j.properties
/tmp/fifo-qomF
will start 1
will start 2
will start 3
will start 4
worker 2 started
worker 1 started
waiting, started 2 of 4
worker 4 started
worker 3 started
sending MetadataAuditEvent_v4 --partitions 1 --topic MetadataAuditEvent_v4
sending MetadataChangeEvent_v4 --partitions 1 --topic MetadataChangeEvent_v4
sending FailedMetadataChangeEvent_v4 --partitions 1 --topic FailedMetadataChangeEvent_v4
sending MetadataChangeLog_Versioned_v1 --partitions 1 --topic MetadataChangeLog_Versioned_v1
sending MetadataChangeLog_Timeseries_v1 --partitions 1 --config retention.ms=7776000000 --topic MetadataChangeLog_Timeseries_v1
sending MetadataChangeProposal_v1 --partitions 1 --topic MetadataChangeProposal_v1
sending FailedMetadataChangeProposal_v1 --partitions 1 --topic FailedMetadataChangeProposal_v1
sending PlatformEvent_v1 --partitions 1 --topic PlatformEvent_v1
sending DataHubUpgradeHistory_v1 --partitions 1 --config retention.ms=-1 --topic DataHubUpgradeHistory_v1
sending DataHubUsageEvent_v1 --partitions 1 --topic DataHubUsageEvent_v1
2 got work_id=MetadataAuditEvent_v4 topic_args=--partitions 1 --topic MetadataAuditEvent_v4
1 got work_id=MetadataChangeEvent_v4 topic_args=--partitions 1 --topic MetadataChangeEvent_v4
4 got work_id=FailedMetadataChangeEvent_v4 topic_args=--partitions 1 --topic FailedMetadataChangeEvent_v4
3 got work_id=MetadataChangeLog_Versioned_v1 topic_args=--partitions 1 --topic MetadataChangeLog_Versioned_v1
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic MetadataAuditEvent_v4.
2 got work_id=MetadataChangeLog_Timeseries_v1 topic_args=--partitions 1 --config retention.ms=7776000000 --topic MetadataChangeLog_Timeseries_v1
1 got work_id=MetadataChangeProposal_v1 topic_args=--partitions 1 --topic MetadataChangeProposal_v1
Created topic FailedMetadataChangeEvent_v4.
4 got work_id=FailedMetadataChangeProposal_v1 topic_args=--partitions 1 --topic FailedMetadataChangeProposal_v1
3 got work_id=PlatformEvent_v1 topic_args=--partitions 1 --topic PlatformEvent_v1
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic FailedMetadataChangeProposal_v1.
1 got work_id=DataHubUpgradeHistory_v1 topic_args=--partitions 1 --config retention.ms=-1 --topic DataHubUpgradeHistory_v1
Created topic PlatformEvent_v1.
2 got work_id=DataHubUsageEvent_v1 topic_args=--partitions 1 --topic DataHubUsageEvent_v1
4 done working
3 done working
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic DataHubUsageEvent_v1.
2 done working
1 done working
Topic Creation Complete.

The specific command in question is shown here: https://github.com/datahub-project/datahub/blob/v0.10.4/docker/kafka-setup/kafka-setup.sh#L155

It's a workaround for this bug: https://github.com/datahub-project/datahub/issues/7882 which is relevant for us, as we are currently deploying with the environment variable: BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATE: false set from this commit: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/936271

So for now, I will once again disable the kafka-setup job, now that the topics have been created, then proceed with the deploy.

Change 936670 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Disable the kafka-setup job in datahub

https://gerrit.wikimedia.org/r/936670

Change 936670 merged by jenkins-bot:

[operations/deployment-charts@master] Disable the kafka-setup job in datahub

https://gerrit.wikimedia.org/r/936670

Maintenance_bot removed a project: Patch-For-Review.Jul 10 2023, 11:10 AM

Change 936675 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Use plaintext port 8080 for local schema registry in datahub

https://gerrit.wikimedia.org/r/936675

gerritbot added a project: Patch-For-Review.Jul 10 2023, 11:23 AM

Change 936675 merged by jenkins-bot:

[operations/deployment-charts@master] Use plaintext port 8080 for local schema registry in datahub

https://gerrit.wikimedia.org/r/936675

Initial testing of the internal schema registry for datahub didn't work very well, so rather than proceeding with that right now I'm going to create a second karapace instance in T341464: eqiad: 1 VM requested for karapace in support of datahub in staging

I'll then configure the staging instance of datahub to use this.

Maintenance_bot removed a project: Patch-For-Review.Jul 10 2023, 1:29 PM

Change 936706 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a second karapace VM

https://gerrit.wikimedia.org/r/936706

Change 936706 merged by Btullis:

[operations/puppet@production] Add a second karapace VM

https://gerrit.wikimedia.org/r/936706

Maintenance_bot removed a project: Patch-For-Review.Jul 10 2023, 2:10 PM

Change 936753 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure karapace1001 to use the kafka-jumbo cluster

https://gerrit.wikimedia.org/r/936753

gerritbot added a project: Patch-For-Review.Jul 10 2023, 3:19 PM

Change 936753 merged by Btullis:

[operations/puppet@production] Configure karapace1001 to use the kafka-jumbo cluster

https://gerrit.wikimedia.org/r/936753

Maintenance_bot removed a project: Patch-For-Review.Jul 10 2023, 3:30 PM

Change 936791 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Configure datahub staging to use the new karapace instance

https://gerrit.wikimedia.org/r/936791

gerritbot added a project: Patch-For-Review.Jul 10 2023, 8:28 PM

Change 936791 merged by jenkins-bot:

[operations/deployment-charts@master] Configure datahub staging to use the new karapace instance

https://gerrit.wikimedia.org/r/936791

Change 936792 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure the test datahub jobs to use the staging schema registry

https://gerrit.wikimedia.org/r/936792

Change 936793 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Permit staging datahub to access karapace1002

https://gerrit.wikimedia.org/r/936793

Change 936793 merged by jenkins-bot:

[operations/deployment-charts@master] Permit staging datahub to access karapace1002

https://gerrit.wikimedia.org/r/936793

This is a first! I've successfully ingested sample data to the staging deployment of datahub. This is great because it shows that end-to-end ingestion works with 0.10.4.

In order to do it, I need the following things:

An entry in my /etc/hosts file on my workstation, spoofing my local DNS resolution to resolve datahub-gms.k8s-staging.discovery.wmnet as 127.0.0.1

btullis@marlin:~$ head -n 1 /etc/hosts
127.0.0.1	localhost datahub-frontend.k8s-staging.discovery.wmnet datahub-gms.k8s-staging.discovery.wmnet

An SSH tunnel to the ingress endpoint on the (eqiad) staging cluster: ssh -N -L 30443:k8s-ingress-staging.svc.eqiad.wmnet:30443 deploy1002.eqiad.wmnet
A conda environment set up on my workstation with pip install acryl-datahub==0.10.4 having been run.
A recipe for ingestion that looks like this:

(datahub) btullis@marlin:~/src/datahub-ingestion$ cat datahub.yml 
source:
  type: demo-data
  config: {}
sink:
  type: "datahub-rest"
  config:
    server: "https://datahub-gms.k8s-staging.discovery.wmnet:30443"
    disable_ssl_verification: true

The ingestion was initiated like this:

datahub ingest -c datahub.yml

Thre were some certificate verifiation warnings, but the CLI report from the ingestion run was like this:

Cli report:
{'cli_version': '0.10.4',
 'cli_entry_location': '/home/btullis/miniconda3/envs/datahub/lib/python3.11/site-packages/datahub/__init__.py',
 'py_version': '3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0]',
 'py_exec_path': '/home/btullis/miniconda3/envs/datahub/bin/python',
 'os_details': 'Linux-6.2.0-24-generic-x86_64-with-glibc2.37',
 'peak_memory_usage': '89.17 MB',
 'mem_info': '89.17 MB',
 'peak_disk_usage': '840.21 GB',
 'disk_info': {'total': '981.13 GB', 'used': '840.21 GB', 'free': '91.01 GB'}}
Source (demo-data) report:
{'events_produced': 101,
 'events_produced_per_sec': 21,
 'entities': {'corpuser': ['urn:li:corpuser:datahub', 'urn:li:corpuser:jdoe'],
              'corpGroup': ['urn:li:corpGroup:jdoe', 'urn:li:corpGroup:bfoo'],
              'dataset': ['urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)',
                          'urn:li:dataset:(urn:li:dataPlatform:s3,project/root/events/logging_events_bckp,PROD)'],
              'dataJob': ['urn:li:dataJob:(urn:li:dataFlow:(airflow,dag_abc,PROD),task_123)',
                          'urn:li:dataJob:(urn:li:dataFlow:(airflow,dag_abc,PROD),task_456)'],
              'dataFlow': ['urn:li:dataFlow:(airflow,dag_abc,PROD)'],
              'chart': ['urn:li:chart:(looker,baz1)', 'urn:li:chart:(looker,baz2)'],
              'dashboard': ['urn:li:dashboard:(looker,baz)'],
              'mlModel': ['urn:li:mlModel:(urn:li:dataPlatform:science,scienceModel,PROD)'],
              'tag': ['urn:li:tag:Legacy', 'urn:li:tag:NeedsDocumentation'],
              'dataPlatform': ['urn:li:dataPlatform:adlsGen1',
                               'urn:li:dataPlatform:adlsGen2',
                               'urn:li:dataPlatform:ambry',
                               'urn:li:dataPlatform:couchbase',
                               'urn:li:dataPlatform:hive',
                               'urn:li:dataPlatform:kafka',
                               'urn:li:dataPlatform:snowflake',
                               'urn:li:dataPlatform:redshift',
                               'urn:li:dataPlatform:bigquery',
                               'urn:li:dataPlatform:glue',
                               '... sampled of 27 total elements'],
              'mlPrimaryKey': ['urn:li:mlPrimaryKey:(test_feature_table_all_feature_dtypes,dummy_entity_1)',
                               'urn:li:mlPrimaryKey:(test_feature_table_all_feature_dtypes,dummy_entity_2)',
                               'urn:li:mlPrimaryKey:(test_feature_table_no_labels,dummy_entity_2)',
                               'urn:li:mlPrimaryKey:(test_feature_table_single_feature,dummy_entity_1)',
                               'urn:li:mlPrimaryKey:(user_features,user_name)',
                               'urn:li:mlPrimaryKey:(user_features,user_id)',
                               'urn:li:mlPrimaryKey:(user_analytics,user_name)'],
              'mlFeature': ['urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_BOOL_LIST_feature)',
                            'urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_DOUBLE_LIST_feature)',
                            'urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_INT32_LIST_feature)',
                            'urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_INT32_feature)',
                            'urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_INT64_feature)',
                            'urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_STRING_LIST_feature)',
                            'urn:li:mlFeature:(test_feature_table_all_feature_dtypes,test_STRING_feature)',
                            'urn:li:mlFeature:(test_feature_table_no_labels,test_BYTES_feature)',
                            'urn:li:mlFeature:(test_feature_table_single_feature,test_BYTES_feature)',
                            'urn:li:mlFeature:(user_analytics,date_joined)',
                            '... sampled of 20 total elements'],
              'mlFeatureTable': ['urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,test_feature_table_all_feature_dtypes)',
                                 'urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,test_feature_table_no_labels)',
                                 'urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,test_feature_table_single_feature)',
                                 'urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,user_features)',
                                 'urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,user_analytics)'],
              'glossaryTerm': ['urn:li:glossaryTerm:CustomerAccount', 'urn:li:glossaryTerm:SavingAccount', 'urn:li:glossaryTerm:AccountBalance'],
              'glossaryNode': ['urn:li:glossaryNode:ClientsAndAccounts'],
              'container': ['urn:li:container:DATABASE', 'urn:li:container:SCHEMA'],
              'assertion': ['urn:li:assertion:358c683782c93c2fc2bd4bdd4fdb0153'],
              'query': ['urn:li:query:test-query']},
 'aspects': {'corpuser': {'corpUserInfo': 2, 'corpUserStatus': 1, 'status': 2},
             'corpGroup': {'corpGroupInfo': 2, 'status': 2},
             'dataset': {'browsePaths': 2,
                         'datasetProperties': 5,
                         'ownership': 7,
                         'institutionalMemory': 6,
                         'schemaMetadata': 7,
                         'status': 7,
                         'upstreamLineage': 6,
                         'editableSchemaMetadata': 1,
                         'globalTags': 1,
                         'datasetProfile': 2,
                         'operation': 2,
                         'datasetUsageStatistics': 1,
                         'container': 1},
             'dataJob': {'status': 2, 'ownership': 2, 'dataJobInfo': 2, 'dataJobInputOutput': 2},
             'dataFlow': {'status': 1, 'ownership': 1, 'dataFlowInfo': 1},
             'chart': {'status': 2, 'chartInfo': 2, 'globalTags': 1},
             'dashboard': {'status': 1, 'ownership': 1, 'dashboardInfo': 1},
             'mlModel': {'ownership': 1,
                         'mlModelProperties': 1,
                         'mlModelTrainingData': 1,
                         'mlModelEvaluationData': 1,
                         'institutionalMemory': 1,
                         'intendedUse': 1,
                         'mlModelMetrics': 1,
                         'mlModelEthicalConsiderations': 1,
                         'mlModelCaveatsAndRecommendations': 1,
                         'status': 1,
                         'cost': 1},
             'tag': {'status': 2, 'tagProperties': 2, 'ownership': 2},
             'dataPlatform': {'dataPlatformInfo': 27},
             'mlPrimaryKey': {'status': 7, 'mlPrimaryKeyProperties': 7},
             'mlFeature': {'status': 20, 'mlFeatureProperties': 20},
             'mlFeatureTable': {'status': 5, 'browsePaths': 5, 'mlFeatureTableProperties': 5},
             'glossaryTerm': {'status': 3, 'glossaryTermInfo': 3, 'ownership': 3},
             'glossaryNode': {'glossaryNodeInfo': 1, 'ownership': 1, 'status': 1},
             'container': {'containerProperties': 2, 'subTypes': 2, 'dataPlatformInstance': 2, 'container': 1},
             'assertion': {'assertionInfo': 1, 'dataPlatformInstance': 1, 'assertionRunEvent': 1},
             'query': {'queryProperties': 1, 'querySubjects': 1}},
 'warnings': {},
 'failures': {},
 'total_num_files': 1,
 'num_files_completed': 1,
 'files_completed': ['/tmp/tmp6j7gd0nf.json'],
 'percentage_completion': '0%',
 'estimated_time_to_completion_in_minutes': -1,
 'total_bytes_read_completed_files': 120035,
 'current_file_size': 120035,
 'total_parse_time_in_seconds': 0.0,
 'total_count_time_in_seconds': 0.0,
 'total_deserialize_time_in_seconds': 0.0,
 'aspect_counts': {'datasetProfile': 2,
                   'operation': 2,
                   'datasetUsageStatistics': 1,
                   'containerProperties': 2,
                   'subTypes': 2,
                   'dataPlatformInstance': 3,
                   'container': 2,
                   'assertionInfo': 1,
                   'assertionRunEvent': 1,
                   'queryProperties': 1,
                   'querySubjects': 1},
 'entity_type_counts': {'dataset': 6, 'container': 7, 'assertion': 3, 'query': 2},
 'start_time': '2023-07-10 22:26:16.330640 (4.61 seconds ago)',
 'running_time': '4.61 seconds'}
Sink (datahub-rest) report:
{'total_records_written': 101,
 'records_written_per_second': 15,
 'warnings': [],
 'failures': [],
 'start_time': '2023-07-10 22:26:14.625833 (6.32 seconds ago)',
 'current_time': '2023-07-10 22:26:20.946109 (now)',
 'total_duration_in_seconds': 6.32,
 'gms_version': 'v0.10.4',
 'pending_requests': 0}

 Pipeline finished successfully; produced 101 events in 4.61 seconds.

I'm going to aim for an upgrade of the production deployments tomorrow at approximately 10:00 UTC.

I'll take a mydumper backup of the database on an-coord1001 before I start, in case I need to roll back.
I'll also take a backup of /srv/opensearch on each of datahubsearch100[1-3] as well, in case I need to roll back that component.

Change 937057 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Enable the required upgrade jobs for datahub in production

https://gerrit.wikimedia.org/r/937057

I've taken an on-disk backup of each of the datahubsearch nodes in seuqnce by doing the following:

sudo depool
sudo systemctl stop opensearch_1@datahub.service
sudo tar czf ~/datahubsearch1001_backup_T329514.tar.gz /srv/opensearch/
sudo systemctl start opensearch_1@datahub.service
sudo pool

I know it's not 100% consistent to do them all in sequence like this, but I think it would be good enough.

I created a backup of the production datahub database with the following command on db1108

btullis@db1108:~$ sudo mysqldump -S /var/run/mysqld/mysqld.analytics_meta.sock --single-transaction --databases datahub >> ~/datahub_backup_T329514.sql
btullis@db1108:~$ ls -lh datahub_backup_T329514.sql 
-rw-r--r-- 1 btullis wikidev 120M Jul 11 10:19 datahub_backup_T329514.sql

I'm still going to have to configure kafka manually, since we have errors from the kafka-setup job.
The list of topics is:

MetadataAuditEvent_v4
MetadataChangeEvent_v4
FailedMetadataChangeEvent_v4
MetadataChangeLog_Versioned_v1
MetadataChangeLog_Timeseries_v1
MetadataChangeProposal_v1
FailedMetadataChangeProposal_v1
PlatformEvent_v1
DataHubUpgradeHistory_v1
DataHubUsageEvent_v1

I can see that they all exist on kafka-jumbo, except DataHubUpgradeHistory_v1

btullis@kafka-jumbo1001:~$ for t in $(cat datahub-topics.txt); do kafka topics --describe --topic $t; done
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic MetadataAuditEvent_v4
Topic:MetadataAuditEvent_v4	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: MetadataAuditEvent_v4	Partition: 0	Leader: 1005	Replicas: 1005,1003,1006	Isr: 1003,1005,1006
	Topic: MetadataAuditEvent_v4	Partition: 1	Leader: 1008	Replicas: 1008,1004,1002	Isr: 1002,1004,1008
	Topic: MetadataAuditEvent_v4	Partition: 2	Leader: 1007	Replicas: 1007,1006,1002	Isr: 1002,1006,1007
	Topic: MetadataAuditEvent_v4	Partition: 3	Leader: 1009	Replicas: 1009,1002,1005	Isr: 1002,1005,1009
	Topic: MetadataAuditEvent_v4	Partition: 4	Leader: 1001	Replicas: 1001,1005,1008	Isr: 1001,1005,1008
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic MetadataChangeEvent_v4
Topic:MetadataChangeEvent_v4	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: MetadataChangeEvent_v4	Partition: 0	Leader: 1003	Replicas: 1003,1004,1006	Isr: 1003,1004,1006
	Topic: MetadataChangeEvent_v4	Partition: 1	Leader: 1004	Replicas: 1004,1006,1002	Isr: 1002,1004,1006
	Topic: MetadataChangeEvent_v4	Partition: 2	Leader: 1006	Replicas: 1006,1002,1005	Isr: 1002,1005,1006
	Topic: MetadataChangeEvent_v4	Partition: 3	Leader: 1002	Replicas: 1002,1005,1008	Isr: 1002,1005,1008
	Topic: MetadataChangeEvent_v4	Partition: 4	Leader: 1005	Replicas: 1005,1008,1001	Isr: 1001,1005,1008
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic FailedMetadataChangeEvent_v4
Topic:FailedMetadataChangeEvent_v4	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: FailedMetadataChangeEvent_v4	Partition: 0	Leader: 1009	Replicas: 1009,1001,1003	Isr: 1001,1003,1009
	Topic: FailedMetadataChangeEvent_v4	Partition: 1	Leader: 1001	Replicas: 1001,1003,1004	Isr: 1001,1003,1004
	Topic: FailedMetadataChangeEvent_v4	Partition: 2	Leader: 1003	Replicas: 1003,1004,1006	Isr: 1003,1004,1006
	Topic: FailedMetadataChangeEvent_v4	Partition: 3	Leader: 1004	Replicas: 1004,1006,1002	Isr: 1002,1004,1006
	Topic: FailedMetadataChangeEvent_v4	Partition: 4	Leader: 1006	Replicas: 1006,1002,1005	Isr: 1002,1005,1006
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic MetadataChangeLog_Versioned_v1
Topic:MetadataChangeLog_Versioned_v1	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: MetadataChangeLog_Versioned_v1	Partition: 0	Leader: 1001	Replicas: 1001,1005,1008	Isr: 1001,1005,1008
	Topic: MetadataChangeLog_Versioned_v1	Partition: 1	Leader: 1003	Replicas: 1003,1008,1007	Isr: 1003,1007,1008
	Topic: MetadataChangeLog_Versioned_v1	Partition: 2	Leader: 1004	Replicas: 1004,1009,1001	Isr: 1001,1004,1009
	Topic: MetadataChangeLog_Versioned_v1	Partition: 3	Leader: 1006	Replicas: 1006,1001,1003	Isr: 1001,1003,1006
	Topic: MetadataChangeLog_Versioned_v1	Partition: 4	Leader: 1002	Replicas: 1002,1003,1004	Isr: 1002,1003,1004
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic MetadataChangeLog_Timeseries_v1
Topic:MetadataChangeLog_Timeseries_v1	PartitionCount:5	ReplicationFactor:3	Configs:retention.ms=7776000000
	Topic: MetadataChangeLog_Timeseries_v1	Partition: 0	Leader: 1005	Replicas: 1005,1003,1006	Isr: 1003,1005,1006
	Topic: MetadataChangeLog_Timeseries_v1	Partition: 1	Leader: 1008	Replicas: 1008,1004,1002	Isr: 1002,1004,1008
	Topic: MetadataChangeLog_Timeseries_v1	Partition: 2	Leader: 1007	Replicas: 1007,1006,1002	Isr: 1002,1006,1007
	Topic: MetadataChangeLog_Timeseries_v1	Partition: 3	Leader: 1009	Replicas: 1009,1002,1005	Isr: 1002,1005,1009
	Topic: MetadataChangeLog_Timeseries_v1	Partition: 4	Leader: 1001	Replicas: 1001,1005,1008	Isr: 1001,1005,1008
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic MetadataChangeProposal_v1
Topic:MetadataChangeProposal_v1	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: MetadataChangeProposal_v1	Partition: 0	Leader: 1002	Replicas: 1002,1003,1004	Isr: 1002,1003,1004
	Topic: MetadataChangeProposal_v1	Partition: 1	Leader: 1005	Replicas: 1005,1003,1006	Isr: 1003,1005,1006
	Topic: MetadataChangeProposal_v1	Partition: 2	Leader: 1008	Replicas: 1008,1004,1002	Isr: 1002,1004,1008
	Topic: MetadataChangeProposal_v1	Partition: 3	Leader: 1007	Replicas: 1007,1006,1002	Isr: 1002,1006,1007
	Topic: MetadataChangeProposal_v1	Partition: 4	Leader: 1009	Replicas: 1009,1002,1005	Isr: 1002,1005,1009
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic FailedMetadataChangeProposal_v1
Topic:FailedMetadataChangeProposal_v1	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: FailedMetadataChangeProposal_v1	Partition: 0	Leader: 1005	Replicas: 1005,1003,1006	Isr: 1003,1005,1006
	Topic: FailedMetadataChangeProposal_v1	Partition: 1	Leader: 1008	Replicas: 1008,1004,1002	Isr: 1002,1004,1008
	Topic: FailedMetadataChangeProposal_v1	Partition: 2	Leader: 1007	Replicas: 1007,1006,1002	Isr: 1002,1006,1007
	Topic: FailedMetadataChangeProposal_v1	Partition: 3	Leader: 1009	Replicas: 1009,1002,1005	Isr: 1002,1005,1009
	Topic: FailedMetadataChangeProposal_v1	Partition: 4	Leader: 1001	Replicas: 1001,1005,1008	Isr: 1001,1005,1008
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic PlatformEvent_v1
Topic:PlatformEvent_v1	PartitionCount:1	ReplicationFactor:3	Configs:
	Topic: PlatformEvent_v1	Partition: 0	Leader: 1005	Replicas: 1005,1003,1006	Isr: 1003,1005,1006
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic DataHubUpgradeHistory_v1
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic DataHubUsageEvent_v1
Topic:DataHubUsageEvent_v1	PartitionCount:5	ReplicationFactor:3	Configs:
	Topic: DataHubUsageEvent_v1	Partition: 0	Leader: 1002	Replicas: 1002,1005,1008	Isr: 1002,1005,1008
	Topic: DataHubUsageEvent_v1	Partition: 1	Leader: 1005	Replicas: 1005,1008,1001	Isr: 1001,1005,1008
	Topic: DataHubUsageEvent_v1	Partition: 2	Leader: 1008	Replicas: 1008,1007,1001	Isr: 1001,1007,1008
	Topic: DataHubUsageEvent_v1	Partition: 3	Leader: 1007	Replicas: 1007,1009,1001	Isr: 1001,1007,1009
	Topic: DataHubUsageEvent_v1	Partition: 4	Leader: 1009	Replicas: 1009,1001,1003	Isr: 1001,1003,1009

They all have five partitions, apart from PlatformEvent_v1 which has one partition.

I have created the missing topic with:

btullis@kafka-jumbo1001:~$ kafka topics --create --if-not-exists --partitions 1 --replication-factor 3 --config retention.ms=-1 --topic DataHubUpgradeHistory_v1
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --create --if-not-exists --partitions 1 --replication-factor 3 --config retention.ms=-1 --topic DataHubUpgradeHistory_v1
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic "DataHubUpgradeHistory_v1".

Verified its existence with:

btullis@kafka-jumbo1001:~$ kafka topics --describe --topic DataHubUpgradeHistory_v1
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic DataHubUpgradeHistory_v1
Topic:DataHubUpgradeHistory_v1	PartitionCount:1	ReplicationFactor:3	Configs:retention.ms=-1
	Topic: DataHubUpgradeHistory_v1	Partition: 0	Leader: 1007	Replicas: 1007,1009,1001	Isr: 1007,1009,1001

I'm also checking that the schema cleanup policy is correct on the _schemas topic.

btullis@kafka-jumbo1001:~$ kafka topics --describe --topic _schemas
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic _schemas
Topic:_schemas	PartitionCount:1	ReplicationFactor:1	Configs:cleanup.policy=compact
	Topic: _schemas	Partition: 0	Leader: 1007	Replicas: 1007	Isr: 1007

This is correct and has the value: Configs:cleanup.policy=compact - This matches what the kafka-setup.sh script would set here, if we were able to support this configuration.

Also, according to the kafka-setup.sh script there is no reason why PlatformEvent_v1 should only have one partition. I have tried altering that with:

btullis@kafka-jumbo1001:~$ kafka topics --alter partitions=5 --topic PlatformEvent_v1
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --alter partitions=5 --topic PlatformEvent_v1

...but it didn't seem to take:

btullis@kafka-jumbo1001:~$ kafka topics --describe --topic PlatformEvent_v1
kafka-topics --zookeeper conf1007.eqiad.wmnet,conf1008.eqiad.wmnet,conf1009.eqiad.wmnet/kafka/jumbo-eqiad --describe --topic PlatformEvent_v1
Topic:PlatformEvent_v1	PartitionCount:1	ReplicationFactor:3	Configs:
	Topic: PlatformEvent_v1	Partition: 0	Leader: 1005	Replicas: 1005,1003,1006	Isr: 1003,1005,1006

Oh well, I'll come back to that.

Change 937057 merged by jenkins-bot:

[operations/deployment-charts@master] Enable the required upgrade jobs for datahub in production

https://gerrit.wikimedia.org/r/937057

The upgrade has gone well, I think. The only this is that it looks like the sample data I ingested into the staging instance yesterday ended up in production too.

I've run a restore-indices-job to see if I can cleanse these.
I couldn't run the job as the normal datahub user, due to a permissions issue.

btullis@deploy1002:~$ kubectl create job --from=cronjob/datahub-main-restore-indices-job-template datahub-restore-indices-job
error: failed to create job: jobs.batch is forbidden: User "datahub" cannot create resource "jobs" in API group "batch" in the namespace "datahub"

So I created the job using the admin user.

btullis@deploy1002:~$ sudo -i
root@deploy1002:~# kube-env admin eqiad
root@deploy1002:~# kubectl create job -n datahub --from=cronjob/datahub-main-restore-indices-job-template datahub-restore-indices-job
job.batch/datahub-restore-indices-job created
root@deploy1002:~# logout

Oh, it looks like the cleanup step hasn't been requested.

2023-07-11 11:54:47,375 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Starting upgrade with id RestoreIndices...
2023-07-11 11:54:47,375 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Cleanup has not been requested.
2023-07-11 11:54:47,375 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Skipping Step 1/3: ClearSearchServiceStep...
2023-07-11 11:54:47,375 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Cleanup has not been requested.
2023-07-11 11:54:47,375 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Skipping Step 2/3: ClearGraphServiceStep...
2023-07-11 11:54:47,375 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Executing Step 3/3: SendMAEStep...
2023-07-11 11:54:47,383 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Sending MAE from local DB

So it makes sure that everything is present, but it hasn't removed entries from the search indices that aren't in the database. I'll see if it's possible to do this any other way.

BTullis mentioned this in T327969: null shown in the user profile dropdown in datahub.Jul 11 2023, 1:40 PM

Change 937099 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add the option to clean datahub indices to the restore job

https://gerrit.wikimedia.org/r/937099

Change 937099 merged by jenkins-bot:

[operations/deployment-charts@master] Add the option to clean datahub indices to the restore job

https://gerrit.wikimedia.org/r/937099

This looks better. It's deleting the existing indices and than issuing and MAE for each aspect.

2023-07-11 14:27:18,884 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Starting upgrade with id RestoreIndices...
2023-07-11 14:27:18,885 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Executing Step 1/3: ClearSearchServiceStep...
2023-07-11 14:27:22,088 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Completed Step 1/3: ClearSearchServiceStep successfully.
2023-07-11 14:27:22,089 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Executing Step 2/3: ClearGraphServiceStep...
2023-07-11 14:27:22,980 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Completed Step 2/3: ClearGraphServiceStep successfully.
2023-07-11 14:27:22,980 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Executing Step 3/3: SendMAEStep...
2023-07-11 14:27:22,981 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Sending MAE from local DB
2023-07-11 14:27:23,280 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Found 39667 latest aspects in aspects table in 0.00 minutes.
2023-07-11 14:27:23,289 [pool-12-thread-1] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Args are RestoreIndicesArgs(start=0, batchSize=1000, numThreads=1, batchDelayMs=100, aspectName=null, urn=null, urnLike=null)
2023-07-11 14:27:23,289 [pool-12-thread-1] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Reading rows 0 through 1000 from the aspects table started.
2023-07-11 14:27:23,292 [pool-12-thread-1] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Reading rows 0 through 1000 from the aspects table completed.
2023-07-11 14:27:30,375 [pool-12-thread-1] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Args are RestoreIndicesArgs(start=1000, batchSize=1000, numThreads=1, batchDelayMs=100, aspectName=null, urn=null, urnLike=null)
2023-07-11 14:27:30,375 [pool-12-thread-1] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Reading rows 1000 through 2000 from the aspects table started.
2023-07-11 14:27:30,375 [pool-12-thread-1] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Reading rows 1000 through 2000 from the aspects table completed.
...

Success! The cleanup job has successfully removed all of the errant data from elasticsearch and rebuilt the indices.

BTullis moved this task from In Progress to Needs Reporting on the Data-Platform-SRE board.Jul 11 2023, 2:34 PM

Change 937137 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/refinery@master] Update the datahub packaged environment to v0.10.4

https://gerrit.wikimedia.org/r/937137

@Milimetric - I could do with your help to update the conda environment for the datahub client please, if you have a moment.
I can see the README but I'm getting confused about some of the details.

BTullis updated the task description. (Show Details)Jul 11 2023, 3:47 PM

Change 937137 merged by Milimetric:

[analytics/refinery@master] Update the datahub packaged environment to v0.10.4

https://gerrit.wikimedia.org/r/937137

Change 936792 merged by Btullis:

[operations/puppet@production] Configure the test datahub jobs to use the staging schema registry

https://gerrit.wikimedia.org/r/936792

btullis opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/454

Update the schema registry used for airflow lineage in test

Change 938214 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add missing global values to the datahub subcharts to fix CI

https://gerrit.wikimedia.org/r/938214

Change 938214 merged by jenkins-bot:

[operations/deployment-charts@master] Add missing global values to the datahub subcharts to fix CI

https://gerrit.wikimedia.org/r/938214

BTullis moved this task from Needs Reporting to Done on the Data-Platform-SRE board.Jul 18 2023, 5:01 PM

Change 898956 abandoned by Stevemunene:

[analytics/datahub@wmf] Build datahub v0.10.0 containers

Reason:

datahub was upgraded to v0.14.0

https://gerrit.wikimedia.org/r/898956

Gehel closed this task as Resolved.Jul 21 2023, 12:53 PM

Change 943549 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add a missing environment variable to datahub/mae-consumer

https://gerrit.wikimedia.org/r/943549

Change 943549 merged by jenkins-bot:

[operations/deployment-charts@master] Add a missing environment variable to datahub/mae-consumer

https://gerrit.wikimedia.org/r/943549

btullis opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/473

Bump the version of the datahub packaged environment

btullis merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/473

Bump the version of the datahub packaged environment

aqu merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/454

Update the schema registry used for airflow lineage in test

Upgrade Datahub to v0.10.4
Closed, ResolvedPublic3 Estimated Story Points
Actions

Description

Details

Related Objects

Event Timeline

	• EChetty
	Feb 13 2023, 1:46 PM

	F37136000: image.png
	Jul 11 2023, 2:33 PM

	F37135853: image.png
	Jul 11 2023, 12:00 PM

	F37135199: image.png
	Jul 10 2023, 9:53 PM

Upgrade Datahub to v0.10.4Closed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Upgrade Datahub to v0.10.4
Closed, ResolvedPublic3 Estimated Story Points
Actions