Page MenuHomePhabricator

Stop / remove linkrecommendation-production-load-datasets-1618311600-hn6k8
Closed, ResolvedPublic

Description

There was a bug in the load-datasets.py script which is running as a CronJob in the linkrecommendation service. Basically, while iterating over datasets to import for each wiki in a list of wikis (there are ten wikis), we are appending to a list of datasets to import instead of re-initializing it. The result is output like:

== Importing datasets (anchors, redirects, pageids, w2vfiltered, model) for cswiki ==
== Importing datasets (anchors, redirects, pageids, w2vfiltered, model, anchors, redirects, pageids, w2vfiltered, model) for simplewiki ==
== Importing datasets (anchors, redirects, pageids, w2vfiltered, model, anchors, redirects, pageids, w2vfiltered, model, anchors, redirects, pageids, w2vfiltered, model) for arwiki ==

And so on. There's no harm other than a waste of CPU resources and time (we'd like the datasets to finish updating sooner rather than later), so if linkrecommendation-production-load-datasets-1618311600-hn6k8 could be removed so that a new cron job container is created with the latest deployed code (fix was in https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/678918), that'd be nice.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2021-04-13T20:47:45Z] <mutante> [kubemaster1001:~] $ sudo kubectl delete pod linkrecommendation-production-load-datasets-1618311600-hn6k8 -n linkrecommendation (T280076)

[kubemaster1001:~] $ sudo kubectl get pods -n linkrecommendation
NAME                                                           READY   STATUS      RESTARTS   AGE
linkrec-load-dataset-debug-pc5pg                               1/1     Running     0          10h
linkrecommendation-external-5c5bb5c69c-m2f9m                   3/3     Running     0          73m
linkrecommendation-production-5c6896d49c-2rzqr                 3/3     Running     0          71m
linkrecommendation-production-5c6896d49c-5vk5g                 3/3     Running     0          73m
linkrecommendation-production-5c6896d49c-fhpp9                 3/3     Running     0          73m
linkrecommendation-production-5c6896d49c-gns6l                 3/3     Running     0          72m
linkrecommendation-production-5c6896d49c-ltn82                 3/3     Running     0          73m
linkrecommendation-production-5c6896d49c-mrg7g                 3/3     Running     0          72m
linkrecommendation-production-5c6896d49c-pzc5b                 3/3     Running     0          73m
linkrecommendation-production-5c6896d49c-sq4ph                 3/3     Running     0          72m
linkrecommendation-production-load-datasets-1616997600-8tkcv   0/1     Completed   0          15d
linkrecommendation-production-load-datasets-1617001200-g9bds   0/1     Completed   0          15d
linkrecommendation-production-load-datasets-1617004800-77j4h   0/1     Completed   0          15d
linkrecommendation-production-load-datasets-1618308000-m455v   0/1     Error       0          10h
linkrecommendation-production-load-datasets-1618308000-pc5pg   0/1     Error       0          10h
linkrecommendation-production-load-datasets-1618311600-hn6k8   1/1     Running     0          9h
tiller-974b97fc7-qrlp7                                         1/1     Running     0          5d12h


[kubemaster1001:~] $ sudo kubectl delete pod linkrecommendation-production-load-datasets-1618311600-hn6k8 -n linkrecommendation

pod "linkrecommendation-production-load-datasets-1618311600-hn6k8" deleted
Dzahn claimed this task.

I deleted the pod as requested ^. Hope that was correct as I had not done it before in production.

I deleted the pod as requested ^. Hope that was correct as I had not done it before in production.

Perfect, thank you @Dzahn! I have also confirmed that the new cronjob pod is working correctly.

Hmm, I think I spoke too soon:

kubectl describe pod/linkrecommendation-production-load-datasets-1618311600-mccln
Image:         docker-registry.discovery.wmnet/wikimedia/research-mwaddlink:2021-04-08-210153-production

But it should be using the 2021-04-13-190913-production image.

What I think what happened here is:

  • The CronJob (with the old image set in spec) created a Job "linkrecommendation-production-load-datasets-1618311600"
  • That Job created a Pod (linkrecommendation-production-load-datasets-1618311600-hn6k8) - old image, ofc.
  • That Pod was deleted by @Dzahn
  • The Job (watching over the Pod) could not find the Pod and sheduled a new one
  • The CronJob otoh would not schedule a new Job, as concurrency is 1 and there already was an "active" Job.

What I did:

  • Deleted the Job (I actually deleted both of the errorinng Jobs to be sure)
  • Which in turn deletes the Pod(s) (because they are managed by the Job - kind of cascading delete then)
root@deploy1002:~(eqiad:kube-system)# kubectl -n linkrecommendation get jobs                     
NAME                                                     COMPLETIONS   DURATION   AGE                                                                    
linkrecommendation-production-load-datasets-1616997600   1/1           20s        16d                                                                    
linkrecommendation-production-load-datasets-1617001200   1/1           19s        16d                                                                    
linkrecommendation-production-load-datasets-1617004800   1/1           19s        16d                                                                    
linkrecommendation-production-load-datasets-1618308000   0/1           23h        23h                                                                    
linkrecommendation-production-load-datasets-1618311600   0/1           22h        22h                                                                    
root@deploy1002:~(eqiad:kube-system)# kubectl -n linkrecommendation delete job linkrecommendation-production-load-datasets-1618308000 linkrecommendation-p
roduction-load-datasets-1618311600                                                                                                                       
job.batch "linkrecommendation-production-load-datasets-1618308000" deleted                                                                               
job.batch "linkrecommendation-production-load-datasets-1618311600" deleted

Now the CronJob created a new Job (new image) which in turn created a new Pod that is currently running:

$ kubectl get job,pod
NAME                                                               COMPLETIONS   DURATION   AGE
job.batch/linkrecommendation-production-load-datasets-1616997600   1/1           20s        16d
job.batch/linkrecommendation-production-load-datasets-1617001200   1/1           19s        16d
job.batch/linkrecommendation-production-load-datasets-1617004800   1/1           19s        16d
job.batch/linkrecommendation-production-load-datasets-1618390800   0/1           9m59s      9m59s
$ kubectl get po -l job-name=linkrecommendation-production-load-datasets-1618390800
NAME                                                           READY   STATUS    RESTARTS   AGE
linkrecommendation-production-load-datasets-1618390800-np4ch   1/1     Running   0          11m

@JMeybohm appreciate the detailed explanation including commands. TIL