Page MenuHomePhabricator

Datahub errors in staging-codfw
Closed, ResolvedPublic

Description

As part of the upgrade of kubernetes cluster to version 1.23, we have been testing all of our deployments on staging-codfw.

Datahub showed errors and memory leaks.

Event Timeline

Change 883226 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Add reverse DNS IPv4 entries for the staging-codfw k8s cluster

https://gerrit.wikimedia.org/r/883226

Change 883226 merged by Btullis:

[operations/dns@master] Add reverse DNS IPv4 entries for the staging-codfw k8s cluster

https://gerrit.wikimedia.org/r/883226

I tried adding reverse DNS entries for the staging-codfw cluster, since this was a difference between it and the staging-eqiad cluster.

While the patch itself worked, it didn't fix the issue with datahub. What happens is that, on startup, the GMS container exhibits a runaway memory leak and is killed by the oomkiller.

We have tried increasing the limits from 2 GB to 3 GB of RAM for this pod, but it still used it all. On the staging-eqiad cluster, the GMS pod uses no more than 1GB and is stable.

image.png (204×3 px, 67 KB)

I have tried checking the firewall and any other IP restrictions on all back-end data stores, but I can't find anything that would be blocked.

Finally, I have also tried running the mce-consumer and mae-consumer in their own pods like we used to, but that didn't help either. The GMS logs are very verbose on startup, but now I'm looking through those for any other clues as to why it's not working.

I did not find any real clues as well. What I do see is that GMS does get killed at random stages during startup and the logs do not differ on retries.
Maybe something that does not log at all is eating all the CPU and memory during startup? I don't know a thing about datahub/GMS so it's really hard to dig around

Is this the last blocker to upgrading staging-eqiad to 1.23 @JMeybohm ?

If so, I wonder whether we should proceed with the upgrade? This would tell us whether the errors are related to the 1.23 upgrade, or whether they are related to something else in staging-codfw.

Is this the last blocker to upgrading staging-eqiad to 1.23 @JMeybohm ?

If so, I wonder whether we should proceed with the upgrade? This would tell us whether the errors are related to the 1.23 upgrade, or whether they are related to something else in staging-codfw.

I treated it like a blocker for now and it is the last one, yes. The thing is that if this turns out to be a problem related to 1.23, you would no longer have a staging system to deploy to.
Do you have a minikube testbed for datahub? Maybe you could try to run that on k8s 1.23 to see if something comes up.

I doubt this is k8s related, though (gut feeling). But maybe it has something to do with underlying changes (I'm thinking the switch to cgroupv2 for example).

Thinking about cgroups tickled my memory and surfaced https://bugs.openjdk.org/browse/JDK-8230305?subTaskView=all
Datahub is currently based on openjdk 11.0.9 so it might be worth a try updating to 11.0.16

I've pushed updated openjdk-11-jdk images so this is just a matter of re-running the datahub blubber pipeline now I suppose.

Change 884280 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/datahub@wmf] Rebuild the datahub containers to pick up new base JDK

https://gerrit.wikimedia.org/r/884280

Change 884280 merged by jenkins-bot:

[analytics/datahub@wmf] Rebuild the datahub containers to pick up new base JDK

https://gerrit.wikimedia.org/r/884280

Change 884287 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Deploy the new datahub image

https://gerrit.wikimedia.org/r/884287

Change 884287 merged by jenkins-bot:

[operations/deployment-charts@master] Deploy the new datahub image

https://gerrit.wikimedia.org/r/884287

Just a quick double-check that it correctly picked up the new version of java.

btullis@marlin:~$ docker run -it --entrypoint=/bin/bash docker-registry.wikimedia.org/wikimedia/datahub-gms:d1f814d28d2b838b11ab9d544323b67dedde9a9f-production
runuser@26227b2bfa1a:/datahub/datahub-gms$ java -version
openjdk version "11.0.16" 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Debian-1deb10u1)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Debian-1deb10u1, mixed mode, sharing)
runuser@26227b2bfa1a:/datahub/datahub-gms$

...which it did. 👍

Attempting a deploy to staging-eqiad first, followed by staging-codfw.

This upgrade to the JRE has worked! Thanks @JMeybohm

I have also verified that it is allowing user logins on the staging-codfw deployment with the following configuration.

  • Modify my local /etc/hosts file to allow datahub-frontend.k8s-staging.discovery.wmnet to resolve to 127.0.0.1
  • Create an SSH tunnel using the ingress port of the staging-codfw cluster: ssh -N -L 30443:k8s-ingress-staging.svc.codfw.wmnet:30443 deploy2002.codfw.wmnet
  • Open this URL in a browser: https://datahub-frontend.k8s-staging.discovery.wmnet:30443/
  • Accept the security warning and login to DataHub

I think we can say that this ticket is resolved and is no longer blocking T327664: Update staging-eqiad to k8s 1.23