Page MenuHomePhabricator

Unable to deploy new version of recommendation-api to production due to connectivity issues
Closed, ResolvedPublic4 Estimated Story Points

Description

The image (2024-11-08-142328-production) works fine with the wmcloud instance (https://recommend.wmcloud.org/) but deploying to production (at the moment on staging) breaks with errors listed in https://phabricator.wikimedia.org/P70998

Event Timeline

From logs https://phabricator.wikimedia.org/P70998

What I see is none of the API calls are succeeding. to *.wikipedia.org, or meta.wikimedia.org or wikidata. All are connection timeouts. That raises the question about the proxy: localhost:6500. From logs it seem that proxy endpoint is not reachable at all. However that is the same proxy being used in production and current deployed version is running fine.

Nikerabbit raised the priority of this task from High to Unbreak Now!.Tue, Nov 12, 7:46 AM
Nikerabbit set the point value for this task to 4.
Nikerabbit subscribed.

Imho deployment blockers are UBN!.

Nikerabbit renamed this task from recommendation-api: Breaks deployment with 2024-11-08-142328-production to Unable to deploy new version of recommendation-api to production due to connectivity issues.Tue, Nov 12, 8:40 AM

Some notes and testing are done by @elukey

  • localhost:6500 works on new pods
  • It is worth following up with the error:
log.error(f\"Error response {exc.response.status_code} while requesting {exc.request.url!r}.\")\n
                                ^^^^^^^^^^^^\nAttributeError: 'ConnectTimeout' object has no attribute 'response'\n"},"

that should be https://gerrit.wikimedia.org/r/plugins/gitiles/research/recommendation-api/+/refs/heads/master/recommendation/external_data/fetcher.py#63 In theory catching a ConnectionTimeout error should be handled differently and we could get better info

I think I found the issue, this is what I see from a new pod (available only for the time of the deployment, then helmfile/helm rolls it back):

elukey@ml-staging2002:~$ sudo nsenter -t 885379 -n netstat -tunap | grep SYN
tcp        0      1 10.194.61.140:54914     208.80.153.224:443      SYN_SENT    885396/python       
tcp        0      1 10.194.61.140:54890     208.80.153.224:443      SYN_SENT    885397/python       
tcp        0      1 10.194.61.140:54910     208.80.153.224:443      SYN_SENT    885399/python       
tcp        0      1 10.194.61.140:54894     208.80.153.224:443      SYN_SENT    885398/python

The IP 208.80.153.224 is text-lb.codfw.wikimedia.org, our front end lbs.. so I think that (part of) the new code tries to connect to the MW API directly without going through the localhost:6500 proxy.

Change #1090590 had a related patch set uploaded (by Santhosh; author: Santhosh):

[research/recommendation-api@master] Improve logging and exception handling

https://gerrit.wikimedia.org/r/1090590

Change #1090593 had a related patch set uploaded (by Santhosh; author: Santhosh):

[operations/deployment-charts@master] recommendation-api-ng: fix wikidata host header

https://gerrit.wikimedia.org/r/1090593

Thanks @elukey. While all our logs shows that all requests are going through localhost:6500, but we see some requests bypassing it, I suspected redirects from API endpoints.

As illustrated in this notebook, the host header for wikidata must be www.wikidata.org and not wikidata.org. wikidata.org will work but will involve a redirect. That redirect wont be localhost:6500 though. That is why we see bypassed requests.

https://colab.research.google.com/drive/1Fsx27n4yK6zpPNt-YLirXHNWvhehkMEi#scrollTo=eAHIbhqyD18L

In the above patches, I disabled auto redirect handling, improved error logging and fixed the configuration.

Change #1090593 merged by jenkins-bot:

[operations/deployment-charts@master] recommendation-api-ng: fix wikidata host header

https://gerrit.wikimedia.org/r/1090593

I've deployed staging with patch https://gerrit.wikimedia.org/r/1090593 and it seems working fine. Thanks @santhosh

Mentioned in SAL (#wikimedia-operations) [2024-11-13T08:54:40Z] <kart_> Updated recommedation-api to 2024-11-08-142328-production and fix wikidata host header (T379592)

Production deployment also went well but we will keep the task open till https://gerrit.wikimedia.org/r/1090590 is deployed as well.

Nikerabbit lowered the priority of this task from Unbreak Now! to High.Wed, Nov 13, 9:29 AM

Change #1090590 merged by jenkins-bot:

[research/recommendation-api@master] Avoid following redirects in external API calls, improve error handling

https://gerrit.wikimedia.org/r/1090590

Change #1089964 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update recommendation api to 2024-11-13-183159-production

https://gerrit.wikimedia.org/r/1089964

Change #1089964 merged by jenkins-bot:

[operations/deployment-charts@master] Update recommendation api to 2024-11-13-183159-production

https://gerrit.wikimedia.org/r/1089964

Mentioned in SAL (#wikimedia-operations) [2024-11-18T12:37:19Z] <kart_> Updated recommendation api to 2024-11-13-183159-production (T379592, T379037)

Since we did 2 successful deployments recently, deployments are no longer blocked!