Datahub errors in staging-codfw
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BTullis
	Jan 24 2023, 4:10 PM

Description

As part of the upgrade of kubernetes cluster to version 1.23, we have been testing all of our deployments on staging-codfw.

Datahub showed errors and memory leaks.

Details

Subject	Repo	Branch	Lines +/-
Deploy the new datahub image	operations/deployment-charts	master	+1 -1
Rebuild the datahub containers to pick up new base JDK	analytics/datahub	wmf	+2 -0
Add reverse DNS IPv4 entries for the staging-codfw k8s cluster	operations/dns	master	+11 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	JMeybohm	T307943 Update Kubernetes clusters to v1.23
Resolved	JMeybohm	T326340 Update staging-codfw to k8s 1.23
Resolved	BTullis	T327799 Datahub errors in staging-codfw

Event Timeline

BTullis created this task.Jan 24 2023, 4:10 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 24 2023, 4:10 PM

BTullis triaged this task as High priority.Jan 24 2023, 4:10 PM

BTullis added projects: Data-Engineering, Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)).

Change 883226 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Add reverse DNS IPv4 entries for the staging-codfw k8s cluster

https://gerrit.wikimedia.org/r/883226

gerritbot added a project: Patch-For-Review.Jan 24 2023, 4:34 PM

JMeybohm added a parent task: T326340: Update staging-codfw to k8s 1.23.Jan 24 2023, 5:54 PM

• EChetty moved this task from Next Up to In Progress on the Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)) board.Jan 25 2023, 1:41 PM

Change 883226 merged by Btullis:

[operations/dns@master] Add reverse DNS IPv4 entries for the staging-codfw k8s cluster

https://gerrit.wikimedia.org/r/883226

I tried adding reverse DNS entries for the staging-codfw cluster, since this was a difference between it and the staging-eqiad cluster.

While the patch itself worked, it didn't fix the issue with datahub. What happens is that, on startup, the GMS container exhibits a runaway memory leak and is killed by the oomkiller.

We have tried increasing the limits from 2 GB to 3 GB of RAM for this pod, but it still used it all. On the staging-eqiad cluster, the GMS pod uses no more than 1GB and is stable.

I have tried checking the firewall and any other IP restrictions on all back-end data stores, but I can't find anything that would be blocked.

Finally, I have also tried running the mce-consumer and mae-consumer in their own pods like we used to, but that didn't help either. The GMS logs are very verbose on startup, but now I'm looking through those for any other clues as to why it's not working.

Maintenance_bot removed a project: Patch-For-Review.Jan 25 2023, 4:30 PM

JMeybohm subscribed.Jan 26 2023, 10:51 AM

I did not find any real clues as well. What I do see is that GMS does get killed at random stages during startup and the logs do not differ on retries.
Maybe something that does not log at all is eating all the CPU and memory during startup? I don't know a thing about datahub/GMS so it's really hard to dig around

Is this the last blocker to upgrading staging-eqiad to 1.23 @JMeybohm ?

If so, I wonder whether we should proceed with the upgrade? This would tell us whether the errors are related to the 1.23 upgrade, or whether they are related to something else in staging-codfw.

In T327799#8561941, @BTullis wrote:

Is this the last blocker to upgrading staging-eqiad to 1.23 @JMeybohm ?

If so, I wonder whether we should proceed with the upgrade? This would tell us whether the errors are related to the 1.23 upgrade, or whether they are related to something else in staging-codfw.

I treated it like a blocker for now and it is the last one, yes. The thing is that if this turns out to be a problem related to 1.23, you would no longer have a staging system to deploy to.
Do you have a minikube testbed for datahub? Maybe you could try to run that on k8s 1.23 to see if something comes up.

I doubt this is k8s related, though (gut feeling). But maybe it has something to do with underlying changes (I'm thinking the switch to cgroupv2 for example).

Thinking about cgroups tickled my memory and surfaced https://bugs.openjdk.org/browse/JDK-8230305?subTaskView=all
Datahub is currently based on openjdk 11.0.9 so it might be worth a try updating to 11.0.16

I've pushed updated openjdk-11-jdk images so this is just a matter of re-running the datahub blubber pipeline now I suppose.

Change 884280 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Rebuild the datahub containers to pick up new base JDK

https://gerrit.wikimedia.org/r/884280

gerritbot added a project: Patch-For-Review.Jan 27 2023, 10:43 AM

Change 884280 merged by jenkins-bot:

[analytics/datahub@wmf] Rebuild the datahub containers to pick up new base JDK

https://gerrit.wikimedia.org/r/884280

Change 884287 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Deploy the new datahub image

https://gerrit.wikimedia.org/r/884287

Change 884287 merged by jenkins-bot:

[operations/deployment-charts@master] Deploy the new datahub image

https://gerrit.wikimedia.org/r/884287

Just a quick double-check that it correctly picked up the new version of java.

btullis@marlin:~$ docker run -it --entrypoint=/bin/bash docker-registry.wikimedia.org/wikimedia/datahub-gms:d1f814d28d2b838b11ab9d544323b67dedde9a9f-production
runuser@26227b2bfa1a:/datahub/datahub-gms$ java -version
openjdk version "11.0.16" 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Debian-1deb10u1)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Debian-1deb10u1, mixed mode, sharing)
runuser@26227b2bfa1a:/datahub/datahub-gms$

...which it did. 👍

Attempting a deploy to staging-eqiad first, followed by staging-codfw.

Maintenance_bot removed a project: Patch-For-Review.Jan 27 2023, 12:30 PM

This upgrade to the JRE has worked! Thanks @JMeybohm

I have also verified that it is allowing user logins on the staging-codfw deployment with the following configuration.

Modify my local /etc/hosts file to allow datahub-frontend.k8s-staging.discovery.wmnet to resolve to 127.0.0.1
Create an SSH tunnel using the ingress port of the staging-codfw cluster: ssh -N -L 30443:k8s-ingress-staging.svc.codfw.wmnet:30443 deploy2002.codfw.wmnet
Open this URL in a browser: https://datahub-frontend.k8s-staging.discovery.wmnet:30443/
Accept the security warning and login to DataHub

I think we can say that this ticket is resolved and is no longer blocking T327664: Update staging-eqiad to k8s 1.23

BTullis closed this task as Resolved.Jan 27 2023, 1:55 PM

BTullis claimed this task.

BTullis moved this task from In Progress to Unexpected work/incident on the Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-07)) board.

BTullis mentioned this in R3101:d1f814d28d2b: Rebuild the datahub containers to pick up new base JDK.Jan 27 2023, 3:48 PM

BTullis mentioned this in T327884: Datahub user records are not being created after login.Jan 31 2023, 12:50 PM

Stevemunene mentioned this in R3101:61a8f0cca7f6: Rebuild the datahub containers to pick up new base JDK.Mar 8 2023, 9:54 AM

Stevemunene mentioned this in R3101:637281f1b836: Rebuild the datahub containers to pick up new base JDK.Mar 8 2023, 11:27 AM

BTullis mentioned this in R3101:c7f3d366aab0: Rebuild the datahub containers to pick up new base JDK.Jul 5 2023, 11:14 AM