While doing T316079#9109164, I noticed the linkrecommendation-internal-load-datasets k8s pod is having errors:
[urbanecm@deploy1002 ~]$ kube_env linkrecommendation eqiad [urbanecm@deploy1002 ~]$ kubectl get pods NAME READY STATUS RESTARTS AGE [...] linkrecommendation-internal-load-datasets-28209900-l7c5w 0/1 Completed 0 31h linkrecommendation-internal-load-datasets-28209960-nr7nb 0/1 Completed 0 30h linkrecommendation-internal-load-datasets-28210020-rjff6 0/1 Completed 0 29h linkrecommendation-internal-load-datasets-28211760-7dxzs 0/1 Error 0 52m linkrecommendation-internal-load-datasets-28211760-k2plz 0/1 Error 0 52m [urbanecm@deploy1002 ~]$
Upon investigating the errors, I saw the following:
[urbanecm@deploy1002 ~]$ kubectl logs linkrecommendation-internal-load-datasets-28211760-k2plz > load-datasets-logs.json [urbanecm@deploy1002 ~]$ jq -r .exc_info < load-datasets-logs.json Traceback (most recent call last): File "load-datasets.py", line 440, in <module> main() File "load-datasets.py", line 433, in main run(args) File "load-datasets.py", line 255, in run % (dataset, remote_checksum.status_code) RuntimeError: Unable to download checksum for anchors, status code: 404. [urbanecm@deploy1002 ~]$ jq -r .output < load-datasets-logs.json [...] == Attempting to download datasets (anchors, redirects, pageids, w2vfiltered, model) for gagwiki == Checksum in database matches remote checksum, skipping download for anchors Checksum in database matches remote checksum, skipping download for redirects Checksum in database matches remote checksum, skipping download for pageids Checksum in database matches remote checksum, skipping download for w2vfiltered Checksum in database matches remote checksum, skipping download for model All datasets for gagwiki are up-to-date! == Attempting to download datasets (anchors, redirects, pageids, w2vfiltered, model) for ganwiki == [urbanecm@deploy1002 ~]$
It seems fetching the data failed for ganwiki, because the datasets are not available. Reviewing the current state at https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/ shows that ganwiki (and krcwiki) directories don't exist, but are advertised in wikis.txt, which is the source of truth for the linkrecommendation k8s service.
We need to update wikis.txt to meet the reality.