Page MenuHomePhabricator

New Service Request: DataHub
Closed, ResolvedPublic

Description

Description: An instance of the WMF fork of LinkedIn's DataHub software. This will act as a metadata repository, facilitating data discovery by users and improving overall data governance. This is in support of the Foundation's Data as a Service OKR.

The design document for this is here. (Currently retricted to WMF)

Timeline: As soon as possible please. We are at an MVP phase and we would like to begin working with the system as soon as possible.
Diagram: Here is a simplified diagram (src), showing the four types of daemon pod, along with their expected network requests. The Kubernetes deployments are all stateless and can be deployed to either data centre, but the back-end data tiers (MariaDB, Elasticsearch, Kafka, Karapace) are all located in eqiad.

DataHub Deployment.png (625×1 px, 77 KB)

Technologies: All components are written in Java, but the front-end also has a React component
Point person: @BTullis and anyone else from the Data-Engineering team

All of the containers have been created using PipelineLib and are hosted on docker-registry.wikimedia.org

Changelog:

  • Uploaded a second version of the diagram, clarifying traffic paths, removing default port numbers, and including the karapace backend.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

How can I tell what the source IP address(es) of my services will be, as seen by the back-end data stores?
Will these be predicatable and when can I find them out?

The reason I ask is that I will need to add firewall rules to allow these applications to connect to their data stores (MariaDB, Elasticsearch etc).
The earlier I could know these, the better for me, thanks.

The diagram doesn't cover prometheus support, but it is included.
I have added: prometheus.io/port: 4318 and prometheus.io/scrape: true to the pods: datahub-gms, datahub-mce-consumer, and datahub-mae-consumer
I think that the datahub-frontend will also expose prometheus metrics, but I will return to configure this later. It's not enabled at the moment.

JMeybohm triaged this task as Medium priority.Mar 7 2022, 9:51 AM

The helm charts and helmfile deployment are now passing the CI helm-lint stage.

How can I tell what the source IP address(es) of my services will be, as seen by the back-end data stores?
Will these be predicatable and when can I find them out?

Those will be the IP ranges of the different k8s clusters (the non-ML ones). You can look those up in netbox: https://netbox.wikimedia.org/search/?q=kubernetes+pod&obj_type=#prefixes

Those will be the IP ranges of the different k8s clusters (the non-ML ones). You can look those up in netbox: https://netbox.wikimedia.org/search/?q=kubernetes+pod&obj_type=#prefixes

Thanks. Will do.

A note with regard to traffic: Currently, the expected level of traffic for both of the services is expected to be very low.

  • datahub.wikimedia.org - This will be public-facing, but for the MVP phase there will be an authentication page shown to visitors. This will require an LDAP account, hopefully with CAS-SSO integration.
  • datahub.discovery.wmnet - This will be an internal only service, but the level of traffic during this phase is expected to be minimal. The frontend and consumer pods will exchange traffic with this service, but there will not be a lot of traffic from other sources.

How can I tell what the source IP address(es) of my services will be, as seen by the back-end data stores?
Will these be predicatable and when can I find them out?

The reason I ask is that I will need to add firewall rules to allow these applications to connect to their data stores (MariaDB, Elasticsearch etc).

You can use already created and ready ferm macros for that. See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/maps/postgresql_common.pp#53 for an example of use

Thus you will use an abstraction layer that is easier to maintain than listing pods IP addresses on your own

The earlier I could know these, the better for me, thanks.

Great. Thanks both. I'm now working through the first set of comments left by @JMeybohm on the patch, trying to make it use the scaffolding more effectively.

As far as I am concerned, this service request LGTM. Thanks for the very detailed diagram (including a link to the source), repos and design doc links. The traffic flows depicted are compliant with our current setup, including reaching out to services that are present in the current analytics vlan (even that separation will become way less important soon, see T298087). Per my undestanding the service will reside in the wikikube cluster for the MVP phase, despite being a bad fit for it per https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#Goal. The justification is the current lack of proper cluster for it and that's acceptable. The final place for it will be the dse-* cluster that is going to be built and which feels way more like the natural home for these applications.

Deploying to the staging cluster can indeed help validate some of the code and assumptions (e.g. the LDAP/CAS integration) before proceeding to fully deploying to production.

I think once the comments on the patches are addressed, we can resolve this.

I'd like to add the proposal of using Ingress (T290966) for the frontend (to not have to configure LVS for that). For the consumers I'm not so sure as we can't put restrictions onto the Ingress LVS (like source networks etc.). If we really need that, I think that needs to be a dedicates LVS.

I'd like to add the proposal of using Ingress (T290966) for the frontend (to not have to configure LVS for that).

Sounds good to me. What can I do to help?
Would this solution be good for the GMS as well?

For the consumers I'm not so sure as we can't put restrictions onto the Ingress LVS (like source networks etc.). If we really need that, I think that needs to be a dedicates LVS.

I'm a bit confused by this bit, because as far as I know the consumers don't have any ingress requirements. They only make connections to kafka, opensearch, and the GMS service. (I mistakenly omitted the consumer-> opensearch arrow from the diagram.)

Per my undestanding the service will reside in the wikikube cluster for the MVP phase, despite being a bad fit for it per https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#Goal. The justification is the current lack of proper cluster for it and that's acceptable. The final place for it will be the dse-* cluster that is going to be built and which feels way more like the natural home for these applications.

Yep, that's 100% the way that I understand it too. The dse-* cluster will use the same deployment pipeline and tooling, so it should be a relatively simple job to lift and shift datahub from minikube to the new cluster, when it is ready.

I'd like to add the proposal of using Ingress (T290966) for the frontend (to not have to configure LVS for that).

Sounds good to me. What can I do to help?
Would this solution be good for the GMS as well?

For the consumers I'm not so sure as we can't put restrictions onto the Ingress LVS (like source networks etc.). If we really need that, I think that needs to be a dedicates LVS.

I'm a bit confused by this bit, because as far as I know the consumers don't have any ingress requirements. They only make connections to kafka, opensearch, and the GMS service. (I mistakenly omitted the consumer-> opensearch arrow from the diagram.)

Sorry, totally my fault! I meant the GMS, not consumer. From what you wrote in T301454#7741876 it sounds like you just don't want this to be publicly reachable, right? So no further internal restrictions.
If that's correct we can use Ingress for that as well if you can refrain from a specific port (as Ingress will force that to be 30443, assuming GMS is talking HTTP).

Sorry, totally my fault! I meant the GMS, not consumer. From what you wrote in T301454#7741876 it sounds like you just don't want this to be publicly reachable, right? So no further internal restrictions.
If that's correct we can use Ingress for that as well if you can refrain from a specific port (as Ingress will force that to be 30443, assuming GMS is talking HTTP).

Yes, that's right.

  • No further internal restrictions on which IP addresses can reach the GMS - (If we want to add restrictions later, we will do this via an authentication and authorization layer within the app itself.)
  • Yes, the GMS speaks HTTP. There are two specific APIs served on this port: GraphQL and Rest.li and they both use HTTP transfer methods.

So I'll change the global.datahub.gms.port value to 30443 in helmfile.d/services/datahub/values.yaml - is that correct?

Yes, that's right.

Great!

So I'll change the global.datahub.gms.port value to 30443 in helmfile.d/services/datahub/values.yaml - is that correct?

I can't tell for sure right now as I did not go over all the charts currently. Actually I was just referring to the diagram, as is mentions specific ports and I wanted to make sure that's not a fixed requirement. Let's turn towards implementing the Ingress part once review of the charts in their current state is done I'd say (should be easy to add).

Actually I was just referring to the diagram, as is mentions specific ports and I wanted to make sure that's not a fixed requirement.

No, that's not a fixed requirement. 9002 is the default for the frontend and 8080 is the default for the GMS service, but there's no need to stick to those.

OK, the latest patchset has a few of the low-hanging fruit addressed, but I'm still working locally on a refactor of the secrets handling.

I've updated the helm charts for datahub so that the secrets handling is compatible with our puppet based secret handling method.

There is still an outstanding question of how to configure the DATAHUB_GMS_HOST variable when TLS is enabled on the GMS server, which is puzzling me a bit.

I've updated the diagram to clarify the way that traffic is intended to flow within the deployment - i.e. requests to the GMS do not need to go via the discovery record.

The default port numbers have also been removed and I've included the karapace schema registry backend store.

Change 769993 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Allow access to MariaDB analytics-meta from Kubernetes pods

https://gerrit.wikimedia.org/r/769993

Change 769993 merged by Btullis:

[operations/puppet@production] Allow access to MariaDB analytics-meta from Kubernetes pods

https://gerrit.wikimedia.org/r/769993

Change 771363 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add dummy deployment user/tokens for datahub

https://gerrit.wikimedia.org/r/771363

I have created deployment users and tokens in profile::kubernetes::infrastructure_users: key in the private repo, as well as corresponding dummy values in the labs/private repo.

Their names are: datahub and datahub-deploy

The tokens are 22 character [A-z0-9] random strings, as per the documentation.

Change 771407 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a kubeconfig configuration for datahub

https://gerrit.wikimedia.org/r/771407

Change 771409 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add a namespace for datahub

https://gerrit.wikimedia.org/r/771409

Change 771363 merged by Btullis:

[labs/private@master] Add dummy deployment user/tokens for datahub

https://gerrit.wikimedia.org/r/771363

Change 771563 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add dummy secrets for datahub deployment

https://gerrit.wikimedia.org/r/771563

I'm sorry to be a pain, but I'm under some pressure to implement this new service as soon as it's practicable, for which I really need help from serviceops
The reason that it is urgent is that one of our team's OKRs is dependent on this service and its implementation has now been declared at risk. (cc @odimitrijevic )

I've carried out as many of the changes as I can from here: https://wikitech.wikimedia.org/wiki/Kubernetes#Add_a_new_service but I'm still unable to proceed to a deployment to staging.
I'm not really sure whether or not there is anything else I can do at the moment to help this along.

  • I haven't created TLS certificates for datahub.wikimedia.org or datahub.discovery.wmnet - although I'm happy to follow this procedure if advised.
  • I'm not sure if DNS changes are required, but again I'm happy to make them if advised.

I understand that ingress is to be used for the datahub-frontend service, but not for the datahub-gms service, is that right or will they both be using ingress?

If not using ingress for the GMS, should I go ahead and add a typical LVS service for datahub-gms, along with DNS discovery records, as per this procedure? https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service

Do I need to make any DNS changes for the services that use ingress? I've been reading this: https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Configuration_(for_service_owners) but I'm not sure how best to proceed.

Once again, many thanks for your assistance and apologies for the continual requests for assistance.

I haven't created TLS certificates for datahub.wikimedia.org

I don't believe you will need a cert for this, IIUC it should use the wikimedia.org wildcard cert.

datahub.discovery.wmnet

I believe you should be able to create this cert following instructions at https://wikitech.wikimedia.org/wiki/PKI/Clients

(Caveat, I have not done this in a while, so I may be WAY OFF here. SRE please correct me!)

I'm trying to get back to this today/tomorrow. You don't need to create any TLS certificates and we can use Ingress for both, frontend and gms.

Change 771407 merged by Btullis:

[operations/puppet@production] Add a kubeconfig configuration for datahub

https://gerrit.wikimedia.org/r/771407

Change 771563 merged by Btullis:

[labs/private@master] Add dummy secrets for datahub deployment

https://gerrit.wikimedia.org/r/771563

Change 771409 merged by Btullis:

[operations/deployment-charts@master] Add a namespace for datahub

https://gerrit.wikimedia.org/r/771409

For the Ingress part we will need to use two different names/discovery records for the services (as we can't distinguish by port). Maybe datahub.discovery.wmnet (or datahub-frontend.discovery.wmnet` to be very explicit) and datahub-gms.discovery.wmnet?

Change 773256 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add correct tlsHostnames and extra SAN to datahub cert

https://gerrit.wikimedia.org/r/773256

For the Ingress part we will need to use two different names/discovery records for the services (as we can't distinguish by port). Maybe datahub.discovery.wmnet (or datahub-frontend.discovery.wmnet to be very explicit) and datahub-gms.discovery.wmnet?

Personally, I'd elect for the more explicit option:

  • datahub-frontend.discovery.wmnet
  • datahub-gms.discovery.wmnet

It's not going to affect the public-facing (but authenticated) URL of https://datahub.wikimedia.org for the frontend service, is it?

Personally, I'd elect for the more explicit option:

I did as well :-) See https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/773256/1

It's not going to affect the public-facing (but authenticated) URL of https://datahub.wikimedia.org for the frontend service, is it?

No, that won't be affected as termination for *.wikimedia.org happens at the CDN already. Those discovery names (and certificates) are just for internal use.

Change 773256 merged by JMeybohm:

[operations/deployment-charts@master] Add correct tlsHostnames and extra SAN to datahub cert

https://gerrit.wikimedia.org/r/773256

Change 777314 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Allow wikikube staging pod range to access kafka eqiad-test cluster

https://gerrit.wikimedia.org/r/777314

Change 777329 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Allow kikikube staging pods to access the analytics-meta test instance

https://gerrit.wikimedia.org/r/777329

Change 777329 merged by Btullis:

[operations/puppet@production] Allow kikikube staging pods to access the analytics-meta test instance

https://gerrit.wikimedia.org/r/777329

Change 777314 merged by Btullis:

[operations/puppet@production] Allow wikikube staging pod range to access kafka eqiad-test cluster

https://gerrit.wikimedia.org/r/777314

Change 779839 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Add an A record for datahub.wikimedia.org

https://gerrit.wikimedia.org/r/779839

Change 779840 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a trafficserver backend mapping rule for datahub

https://gerrit.wikimedia.org/r/779840

I'm unsure what else I need to do now to make this new service available.

I've successfully deployed the service to staging, eqiad and codfw using helmfile.

I have read T305358: service::catalog entries and dnsdisc for Kubernetes services under Ingress but I'm not clear from that what's the best thing for me to do regarding the creation of discovery records.

Change 779840 merged by Btullis:

[operations/puppet@production] Add a trafficserver backend mapping rule for datahub

https://gerrit.wikimedia.org/r/779840

Change 779839 merged by Btullis:

[operations/dns@master] Add an A record for datahub.wikimedia.org

https://gerrit.wikimedia.org/r/779839

datahub.wikimedia.org is now up and running.

image.png (586×1 px, 68 KB)

Now working on getting the datahub-gms.discovery.wmnet service up and running too.

Change 780651 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add datahub-gms to the service catalog

https://gerrit.wikimedia.org/r/780651

Change 780658 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Add a CNAME reference for datahub-gms.discovery.wmnet

https://gerrit.wikimedia.org/r/780658

Change 780658 merged by Btullis:

[operations/dns@master] Add a CNAME reference for datahub-gms.discovery.wmnet

https://gerrit.wikimedia.org/r/780658

Should we call this done, or should we leave it open pending an outcome on T305358: service::catalog entries and dnsdisc for Kubernetes services under Ingress?
Many thanks again for all your support with this request @JMeybohm.

Please keep this open as it is absolutely in a hacky state currently (DNS + service::catalog wise)

Change 787759 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] trafficserver: Switch datahub to new k8s-ingress-wikikube discovery

https://gerrit.wikimedia.org/r/787759

Change 787759 merged by JMeybohm:

[operations/puppet@production] trafficserver: Switch datahub to new k8s-ingress-wikikube discovery

https://gerrit.wikimedia.org/r/787759

I finally managed to verify and document the steps needed to put a service under Ingress. I did also update the general
https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service documentation (which contains a link to the Ingress specific part).
@BTullis: I'd very much like you to go over the new docs to verify those are useful to others. From what I remember datahub still needs:

  • most of the DNS CNAME records (currently only datahub-gms.discovery.wmnet exists)
  • service::catalog entries for datahub-frontend and datahub-gms
  • to make use of datahub-frontend.discovery.wmnet in hieradata/common/profile/trafficserver/backend.yaml

Thanks @JMeybohm - Those docs are really useful. I will proceed to make the changes required.

There's one part that I'm not clear on from the docs. (The instructions are perfectly clear but I don't understand why they say what they do.)

From this section: https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#DNS_changes

image.png (229×618 px, 15 KB)

I understand that it's something to do with DNS Discovery - but it seems counter-intuitive to have to refer to a read-only DNS record if the service supports writing in both DCs. Have I misunderstood something, or is it just a quirk of the setup that I have to get used to?

I understand that it's something to do with DNS Discovery - but it seems counter-intuitive to have to refer to a read-only DNS record if the service supports writing in both DCs. Have I misunderstood something, or is it just a quirk of the setup that I have to get used to?

I was under the impression that datahub should only run/be used in the active datacenter because it relies on state in MySQL and other datastores which are not equally available in both DCs. If your question is about naming (-ro/-rw) that is indeed not super ideal. It's just the way other services that need this distinction are named. As I did not came up with anything better/more clear, I decided to just stick to the standard. But ofc. please feel free to suggest something if you have an idea!

I was under the impression that datahub should only run/be used in the active datacenter because it relies on state in MySQL and other datastores which are not equally available in both DCs.

Thanks. You're right that the stateful components only exist in eqiad, so running datahub in codfw is going to be slower than eqiad. However, the network policies are in place so that it should work in codfw.
I'm happy to take advice on whether this should be set up as an active/passive or active/active service. Do you think active/passive would be better, if our preferred service is eqiad?

If your question is about naming (-ro/-rw) that is indeed not super ideal. It's just the way other services that need this distinction are named. As I did not came up with anything better/more clear, I decided to just stick to the standard. But ofc. please feel free to suggest something if you have an idea!

Cool. I haven't got any ideas at the moment, but thanks for the clarification.

I was under the impression that datahub should only run/be used in the active datacenter because it relies on state in MySQL and other datastores which are not equally available in both DCs.

Thanks. You're right that the stateful components only exist in eqiad, so running datahub in codfw is going to be slower than eqiad. However, the network policies are in place so that it should work in codfw.
I'm happy to take advice on whether this should be set up as an active/passive or active/active service. Do you think active/passive would be better, if our preferred service is eqiad?

If the service is setup as active/active, traffic from end-users that are closer to codfw will be directed to codfw. Those users will get the slower experience you describe, probably coupled with timeouts (we 've tried that in the past in some unrelated tests, 40ms latency for regular SQL query patterns is a huge pain).

One question that I do have is whether we should be switching over this service or not during the next DC switchover[1]. From what I understand the underlying datastores are eqiad only, so if we did switch it over we 'd be effectively penalizing, performance wise, so the answer would be no. Am I understanding correctly?

[1] https://wikitech.wikimedia.org/wiki/Switch_Datacenter

FWIW, we hope that Datahub will one day be a service for more than just analytics data, but for now, it isn't, and can be considered part of the 'analytics cluster' which only exists in eqiad. What that means for DNS names and DC failover for now I'll leave up to yall :)

FWIW, we hope that Datahub will one day be a service for more than just analytics data, but for now, it isn't, and can be considered part of the 'analytics cluster' which only exists in eqiad. What that means for DNS names and DC failover for now I'll leave up to yall :)

Thanks for this input. I think I should add the we using part of the DC failover process whenever we need to upgrade or perform maintenance operations to our kubernetes clusters. What this means for Datahub is that when we upgrade the eqiad cluster (we are planning to upgrade from 1.16 to probably 1.23 this quarter), we 'll have to switch to the codfw cluster. For the duration of the upgrade plus some safety time windows before/after the upgrade, traffic will always be served from codfw and thus have the worse experience mentioned above.

For the duration of the upgrade plus some safety time windows before/after the upgrade, traffic will always be served from codfw and thus have the worse experience mentioned above.

Great, seems acceptible to me!

I finally managed to verify and document the steps needed to put a service under Ingress. I did also update the general
https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service documentation (which contains a link to the Ingress specific part).
@BTullis: I'd very much like you to go over the new docs to verify those are useful to others. From what I remember datahub still needs:

  • most of the DNS CNAME records (currently only datahub-gms.discovery.wmnet exists)
  • service::catalog entries for datahub-frontend and datahub-gms
  • to make use of datahub-frontend.discovery.wmnet in hieradata/common/profile/trafficserver/backend.yaml

Sorry for nudging @BTullis - do you miss any information or need any assistance regarding the remaining steps? Just asking because I would like datahub out of it's slowflake state as far as possible.

Sorry for nudging @BTullis - do you miss any information or need any assistance regarding the remaining steps? Just asking because I would like datahub out of it's slowflake state as far as possible.

Apologies for the delay @JMeybohm, I'll get right on it. I think I have everything I need thanks, it's just that I've just been busy with other things recently. I'll add you to the CRs for review.

Change 805328 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Add DNS CNAME records for datahub ingress on k8s

https://gerrit.wikimedia.org/r/805328

Change 805331 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the trafficserver rule for datahub

https://gerrit.wikimedia.org/r/805331

OK @JMeybohm I've created three CRs that I think should do what we need to finish this.

  • Adding CNAME records to DNS
  • Adding service catalog entries
  • Switching the trafficserver rule (dependent on the DNS change)

Please feel free to review whenever it's convenient for you.

Cool, thanks! +1ed the first two.
The service::catalog entries should be in stage production before switching trafficserver to the discovery record just to be sure.

Change 805328 merged by Btullis:

[operations/dns@master] Add DNS CNAME records for datahub ingress on k8s

https://gerrit.wikimedia.org/r/805328

Change 780651 merged by Btullis:

[operations/puppet@production] Add DataHub GMS and frontend services to the service catalog

https://gerrit.wikimedia.org/r/780651

Change 805395 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Promote the datahub services to production

https://gerrit.wikimedia.org/r/805395

Change 805395 merged by Btullis:

[operations/puppet@production] Promote the datahub services to production

https://gerrit.wikimedia.org/r/805395

Change 805331 merged by Btullis:

[operations/puppet@production] Update the trafficserver rule for datahub

https://gerrit.wikimedia.org/r/805331

All merged. Thanks! 🎉