Page MenuHomePhabricator

linkrecommendation-internal-load-datasets pod is failing
Closed, ResolvedPublic

Description

While doing T316079#9109164, I noticed the linkrecommendation-internal-load-datasets k8s pod is having errors:

[urbanecm@deploy1002 ~]$ kube_env linkrecommendation eqiad
[urbanecm@deploy1002 ~]$ kubectl get pods
NAME                                                       READY   STATUS      RESTARTS   AGE
[...]
linkrecommendation-internal-load-datasets-28209900-l7c5w   0/1     Completed   0          31h
linkrecommendation-internal-load-datasets-28209960-nr7nb   0/1     Completed   0          30h
linkrecommendation-internal-load-datasets-28210020-rjff6   0/1     Completed   0          29h
linkrecommendation-internal-load-datasets-28211760-7dxzs   0/1     Error       0          52m
linkrecommendation-internal-load-datasets-28211760-k2plz   0/1     Error       0          52m
[urbanecm@deploy1002 ~]$

Upon investigating the errors, I saw the following:

[urbanecm@deploy1002 ~]$ kubectl logs linkrecommendation-internal-load-datasets-28211760-k2plz > load-datasets-logs.json
[urbanecm@deploy1002 ~]$ jq -r .exc_info < load-datasets-logs.json
Traceback (most recent call last):
  File "load-datasets.py", line 440, in <module>
    main()
  File "load-datasets.py", line 433, in main
    run(args)
  File "load-datasets.py", line 255, in run
    % (dataset, remote_checksum.status_code)
RuntimeError: Unable to download checksum for anchors, status code: 404.
[urbanecm@deploy1002 ~]$ jq -r .output < load-datasets-logs.json
[...]
== Attempting to download datasets (anchors, redirects, pageids, w2vfiltered, model) for gagwiki ==
   Checksum in database matches remote checksum, skipping download for anchors
   Checksum in database matches remote checksum, skipping download for redirects
   Checksum in database matches remote checksum, skipping download for pageids
   Checksum in database matches remote checksum, skipping download for w2vfiltered
   Checksum in database matches remote checksum, skipping download for model
   All datasets for gagwiki are up-to-date!
== Attempting to download datasets (anchors, redirects, pageids, w2vfiltered, model) for ganwiki ==
[urbanecm@deploy1002 ~]$

It seems fetching the data failed for ganwiki, because the datasets are not available. Reviewing the current state at https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/ shows that ganwiki (and krcwiki) directories don't exist, but are advertised in wikis.txt, which is the source of truth for the linkrecommendation k8s service.

We need to update wikis.txt to meet the reality.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2023-08-22T13:01:33Z] <urbanecm> stat1008: Remove krcwiki and ganwiki from /srv/published/datasets/one-off/research-mwaddlink/wikis.txt (T344686)

Should be fixed now. Once the pod runs again, it should not fail.

This seems to be resolved now:

[urbanecm@deploy1002 ~]$ kubectl get pods
NAME                                                       READY   STATUS      RESTARTS   AGE
[...]
linkrecommendation-internal-load-datasets-28210020-rjff6   0/1     Completed   0          31h
linkrecommendation-internal-load-datasets-28211760-7dxzs   0/1     Error       0          157m
linkrecommendation-internal-load-datasets-28211760-k2plz   0/1     Error       0          157m
linkrecommendation-internal-load-datasets-28211820-czrgz   0/1     Completed   0          97m
linkrecommendation-internal-load-datasets-28211880-4hbqn   0/1     Completed   0          37m
[urbanecm@deploy1002 ~]$

Resolving and filled T344711 as a task to find a long-term solution.