Page MenuHomePhabricator

Update DataHub to latest version for MVP
Closed, ResolvedPublic

Description

When initially setting out to create this MVP we selected the latest version at the time v0.8.28

In the intervening time several new versions have been released, so the latest is now v0.8.32

https://datahubproject.io/docs/releases/

There are several features and fixes that will be of interest, but the most useful one for us will likely be a feature from 0.8.32

RBAC Functionality: View-Based Policies to further fine-tune what your DataHub end-users can see & do in DataHub

We should also take this opportunity to ensure that the upgrade process for datahub is clearly defined and repeatable.

Event Timeline

Here were the guidelines from @hashar on how this might best be achieved.

image.png (384×1 px, 87 KB)

BTullis moved this task from Backlog to In Progress on the Data-Catalog board.
BTullis moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

I have added the original remote url to the analytics/datahub repo.

git remote add linkedin-github git@github.com:datahub-project/datahub.git

Updated the remote.

btullis@marlin-wsl:~/wmf/datahub$ git remote update linkedin-github
Fetching linkedin-github
remote: Enumerating objects: 208019, done.
remote: Counting objects: 100% (5961/5961), done.
remote: Compressing objects: 100% (419/419), done.
remote: Total 208019 (delta 4228), reused 5871 (delta 4199), pack-reused 202058
Receiving objects: 100% (208019/208019), 180.63 MiB | 7.99 MiB/s, done.
Resolving deltas: 100% (100333/100333), completed with 933 local objects.
From github.com:datahub-project/datahub
 * [new branch]            custom-kafka-registry             -> linkedin-github/custom-kafka-registry
 * [new branch]            elasticsearch-5-legacy            -> linkedin-github/elasticsearch-5-legacy
 * [new branch]            feature/file-lineage-ingestion    -> linkedin-github/feature/file-lineage-ingestion
 * [new branch]            gh-pages                          -> linkedin-github/gh-pages
 * [new branch]            master                            -> linkedin-github/master
 * [new branch]            shirshanka-patch-1                -> linkedin-github/shirshanka-patch-1
 * [new branch]            update-yarn-lock-for-os-migration -> linkedin-github/update-yarn-lock-for-os-migration
 * [new branch]            wherehows                         -> linkedin-github/wherehows
 * [new branch]            yarn-auto-gen                     -> linkedin-github/yarn-auto-gen
 * [new tag]               RC-v0.8.28                        -> RC-v0.8.28
 * [new tag]               v0.8.27                           -> v0.8.27
 * [new tag]               v0.8.28                           -> v0.8.28
 * [new tag]               v0.8.28rc1                        -> v0.8.28rc1
 * [new tag]               v0.8.29                           -> v0.8.29
 * [new tag]               v0.8.30                           -> v0.8.30
 * [new tag]               v0.8.31                           -> v0.8.31
 * [new tag]               v0.8.32                           -> v0.8.32

I have created a new ACL in gerrit, allowing members of the Analytics group to push to the master branch:
https://gerrit.wikimedia.org/r/admin/repos/analytics/datahub,access

image.png (170×746 px, 17 KB)

I then pushed the updates from upstream master branch to our master branch.

btullis@marlin-wsl:~/wmf/datahub$ git push origin linkedin-github/master:master
Enumerating objects: 8825, done.
Counting objects: 100% (8825/8825), done.
Delta compression using up to 16 threads
Compressing objects: 100% (3395/3395), done.
Writing objects: 100% (7696/7696), 4.09 MiB | 3.25 MiB/s, done.
Total 7696 (delta 4428), reused 6454 (delta 3323), pack-reused 0
remote: Resolving deltas: 100% (4428/4428)
remote: Processing changes: refs: 1, done
remote: commit f99d27f: warning: subject >100 characters; use shorter first paragraph
remote: commit f99d27f: warning: too many message lines longer than 120 characters; manually wrap lines
remote: commit a20012f: warning: subject >100 characters; use shorter first paragraph
remote: commit c09834d: warning: subject >100 characters; use shorter first paragraph
remote: commit a69eac8: warning: subject >100 characters; use shorter first paragraph
remote: commit a69eac8: warning: too many message lines longer than 120 characters; manually wrap lines
remote: commit 5c80177: warning: subject >100 characters; use shorter first paragraph
remote: commit 12bb2e1: warning: subject >100 characters; use shorter first paragraph
remote: commit 30ed5f2: warning: subject >100 characters; use shorter first paragraph
remote: commit bb413be: warning: subject >100 characters; use shorter first paragraph
remote: commit c713b60: warning: subject >100 characters; use shorter first paragraph
remote: commit 92b0e1c: warning: subject >100 characters; use shorter first paragraph
remote: commit 1438abf: warning: subject >100 characters; use shorter first paragraph
remote: commit 7f4cb87: warning: subject >100 characters; use shorter first paragraph
remote: commit e3599c5: warning: subject >100 characters; use shorter first paragraph
remote: commit f37bdad: warning: subject >100 characters; use shorter first paragraph
To ssh://gerrit.wikimedia.org:29418/analytics/datahub
   6c214add36..5a59d5a1dd  linkedin-github/master -> master

I then pushed the tags as well.

btullis@marlin-wsl:~/wmf/datahub$ git push origin --tags
Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
remote: Processing changes: refs: 8, done
To ssh://gerrit.wikimedia.org:29418/analytics/datahub
 * [new tag]               RC-v0.8.28 -> RC-v0.8.28
 * [new tag]               v0.8.27 -> v0.8.27
 * [new tag]               v0.8.28 -> v0.8.28
 * [new tag]               v0.8.28rc1 -> v0.8.28rc1
 * [new tag]               v0.8.29 -> v0.8.29
 * [new tag]               v0.8.30 -> v0.8.30
 * [new tag]               v0.8.31 -> v0.8.31
 * [new tag]               v0.8.32 -> v0.8.32

Well I tried:

git checkout wmf
git merge v0.8.32

I got several merge conflicts, which I addressed.

I then tried git review wmf

But I had to solve the same merge conflicts again and then I got a whole lot of commits.

btullis@marlin-wsl:~/wmf/datahub$ git review wmf
You are about to submit multiple commits. This is expected if you are
submitting a commit that is dependent on one or more in-review
commits, or if you are submitting multiple self-contained but
dependent changes. Otherwise you should consider squashing your
changes into one commit before submitting (for indivisible changes) or
submitting from separate branches (for independent changes).

The outstanding commits are:

801db1c3ae (HEAD -> wmf) fix(search): handle commas in search queries in the UI (#4570)
5d8e2a6d94 fix: replace direct and indirect references to linkedin with datahub-project (#4557)
c1651b9709 fix(policy): Add view entity page priv to all entity types (#4569)
919f9cfeac fix(bigquery): missing dependency (#4567)
<snip>

Should I squash this into one merge commit before doing a git review wmf?

Looked at this with @Milimetric and @Ottomata - for now we have decided to continue with a rebase model, since we want our modifications to be in a single commit at the HEAD of the wmf branch.

So we just did:

git rebase -i v0.8.32

Change 779898 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update datahub to use version 0.8.32

https://gerrit.wikimedia.org/r/779898

Change 779898 merged by jenkins-bot:

[operations/deployment-charts@master] Update datahub to use version 0.8.32

https://gerrit.wikimedia.org/r/779898

Change 780884 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Add a new config file to the datahub-gms image

https://gerrit.wikimedia.org/r/780884

Change 780884 merged by jenkins-bot:

[analytics/datahub@wmf] Add a new config file to the datahub-gms image

https://gerrit.wikimedia.org/r/780884

Change 780906 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the container images used for datahub

https://gerrit.wikimedia.org/r/780906

Change 780906 merged by jenkins-bot:

[operations/deployment-charts@master] Update the container images used for datahub

https://gerrit.wikimedia.org/r/780906

Change 780924 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Update the location of the datahub war file

https://gerrit.wikimedia.org/r/780924

Change 780924 merged by jenkins-bot:

[analytics/datahub@wmf] Update the location of the datahub war file

https://gerrit.wikimedia.org/r/780924

Change 784235 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the datahub image used for deployment

https://gerrit.wikimedia.org/r/784235

Change 784235 merged by jenkins-bot:

[operations/deployment-charts@master] Update the datahub image used for deployment

https://gerrit.wikimedia.org/r/784235

This is now done. Version 0.8.32 is now rolled out to datahub.wikimedia.org

There was an issue with an incompatible schema between versions, but this was resolved by manually deleting the schema versions stored in karapace.

btullis@karapace1001:/etc/karapace$ curl -X DELETE http://karapace1001:8081/subjects/MetadataChangeLog_Versioned_v1-value
[1,2]btullis@karapace1001:/etc/karapace$ curl -X DELETE http://karapace1001:8081/subjects/MetadataAuditEvent_v4-value
[1]

I referred to the karapace docs here to obtain the state information from karapace: https://github.com/aiven/karapace#quickstart