The image (2024-11-08-142328-production) works fine with the wmcloud instance (https://recommend.wmcloud.org/) but deploying to production (at the moment on staging) breaks with errors listed in https://phabricator.wikimedia.org/P70998
Description
Details
Related Objects
Event Timeline
From logs https://phabricator.wikimedia.org/P70998
What I see is none of the API calls are succeeding. to *.wikipedia.org, or meta.wikimedia.org or wikidata. All are connection timeouts. That raises the question about the proxy: localhost:6500. From logs it seem that proxy endpoint is not reachable at all. However that is the same proxy being used in production and current deployed version is running fine.
Some notes and testing are done by @elukey
- localhost:6500 works on new pods
- It is worth following up with the error:
log.error(f\"Error response {exc.response.status_code} while requesting {exc.request.url!r}.\")\n ^^^^^^^^^^^^\nAttributeError: 'ConnectTimeout' object has no attribute 'response'\n"},"
that should be https://gerrit.wikimedia.org/r/plugins/gitiles/research/recommendation-api/+/refs/heads/master/recommendation/external_data/fetcher.py#63 In theory catching a ConnectionTimeout error should be handled differently and we could get better info
I think I found the issue, this is what I see from a new pod (available only for the time of the deployment, then helmfile/helm rolls it back):
elukey@ml-staging2002:~$ sudo nsenter -t 885379 -n netstat -tunap | grep SYN tcp 0 1 10.194.61.140:54914 208.80.153.224:443 SYN_SENT 885396/python tcp 0 1 10.194.61.140:54890 208.80.153.224:443 SYN_SENT 885397/python tcp 0 1 10.194.61.140:54910 208.80.153.224:443 SYN_SENT 885399/python tcp 0 1 10.194.61.140:54894 208.80.153.224:443 SYN_SENT 885398/python
The IP 208.80.153.224 is text-lb.codfw.wikimedia.org, our front end lbs.. so I think that (part of) the new code tries to connect to the MW API directly without going through the localhost:6500 proxy.
Change #1090590 had a related patch set uploaded (by Santhosh; author: Santhosh):
[research/recommendation-api@master] Improve logging and exception handling
Change #1090593 had a related patch set uploaded (by Santhosh; author: Santhosh):
[operations/deployment-charts@master] recommendation-api-ng: fix wikidata host header
Thanks @elukey. While all our logs shows that all requests are going through localhost:6500, but we see some requests bypassing it, I suspected redirects from API endpoints.
As illustrated in this notebook, the host header for wikidata must be www.wikidata.org and not wikidata.org. wikidata.org will work but will involve a redirect. That redirect wont be localhost:6500 though. That is why we see bypassed requests.
https://colab.research.google.com/drive/1Fsx27n4yK6zpPNt-YLirXHNWvhehkMEi#scrollTo=eAHIbhqyD18L
In the above patches, I disabled auto redirect handling, improved error logging and fixed the configuration.
This patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1090593/1/helmfile.d/ml-services/recommendation-api-ng/values.yaml alone will fix the issue. The recommendation-api is patch is about avoiding these surprises in future
Change #1090593 merged by jenkins-bot:
[operations/deployment-charts@master] recommendation-api-ng: fix wikidata host header
I've deployed staging with patch https://gerrit.wikimedia.org/r/1090593 and it seems working fine. Thanks @santhosh
Mentioned in SAL (#wikimedia-operations) [2024-11-13T08:54:40Z] <kart_> Updated recommedation-api to 2024-11-08-142328-production and fix wikidata host header (T379592)
Production deployment also went well but we will keep the task open till https://gerrit.wikimedia.org/r/1090590 is deployed as well.
Change #1090590 merged by jenkins-bot:
[research/recommendation-api@master] Avoid following redirects in external API calls, improve error handling
Change #1089964 had a related patch set uploaded (by KartikMistry; author: KartikMistry):
[operations/deployment-charts@master] Update recommendation api to 2024-11-13-183159-production
Change #1089964 merged by jenkins-bot:
[operations/deployment-charts@master] Update recommendation api to 2024-11-13-183159-production
Mentioned in SAL (#wikimedia-operations) [2024-11-18T12:37:19Z] <kart_> Updated recommendation api to 2024-11-13-183159-production (T379592, T379037)