Page MenuHomePhabricator

Call to section recommendation API mysteriously failing
Closed, ResolvedPublic

Description

In development, we noticed that calls to the recommendation API for sections translation are failing repeatedly until they finally work.

Recent examples include

In the context of the topic-based suggestions, it looks like it fails when I select a topic I haven't used before. Then that topic eventually works and it works fine when I go back to it.

Event Timeline

How is this ticket a LPL Hypothesis? Sounds like Recommendation-API or SectionTranslation instead?

LPL Hypothesis is the stream of work this task is part of. Those other tags also apply.

I created a simple script to send 50 consecutive requests to the endpoint for this URL: /api/v1/translation/sections?source=en&target=el&seed=Grimes. In my local env, all 50 requests were successful. When I tried the same for the LiftWing production API, 10 out of 50 requests failed.

Thus, I would suppose that it's an issue from the LiftWing API side.

Any suggestions about how we should go forward with it?

SBisson moved this task from Backlog to Prioritized on the LPL Hypothesis board.

When you say failed, what is the exact error/code/message?

When you say failed, what is the exact error/code/message?

500 Internal server error

What is the User Agent you use? Do you have exact timestamps when the errors occurred?

This will help us find the error in our own logs.

What is the User Agent you use? Do you have exact timestamps when the errors occurred?

This will help us find the error in our own logs.

I tried this URL just now and it failed with the same error. UA was Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36

It looks like this is a problem with one of the backends to this service (which is serving 503s, which we then sortof forward).

Specifically, it seems cxserver is serving us 503s. which result in this: https://phabricator.wikimedia.org/P69061 in the rec-api-ng server.

Note the URL http://localhost:6015/v2/suggest/sections/Beetlejuice/en/fr is magically (Istio) forwarded to the CXSERVER endpoint and results in a 503.

The magic bit in turn logs internally:

[2024-09-12T15:35:01.378Z] "GET /v2/suggest/sections/The%20Godfather/en/fr HTTP/1.1" 503 UC 0 95 0 - "-" "WMF Recommendation API (https://recommend.wmflabs.org/; leila@wikimedia.org)" "99ecd643-c19b-46a9-ab91-57b287c2e5bb" "cxserver.wikimedia.org" "10.2.2.18:4002"

The UC code means that the upstream server (at the IP:port at the end of the log line) reset the connection mid-request, or just after the GET was done.

I set up a complete local stack with the app calling a local recommendation API, which is calling a local CX server. After many many tries I got it to fail twice. Both were only reported by python as "ConnectError" from the recommendation API to the MW search API.

I set up a complete local stack with the app calling a local recommendation API, which is calling a local CX server. After many many tries I got it to fail twice. Both were only reported by python as "ConnectError" from the recommendation API to the MW search API.

If this is working fine locally but breaking on cxserver, could it be a resource problem? I dunno if that is running or a VM, which might be starved for some resource.

This URL failed many many times for me just now before finally succeeding. Wondering if there's something that can be found in the log, or if there's a way I can look for it myself. Is it in logstash?

This is a link to logs on logstash.
I tried the same request, got a 500 and found this error in the logs:

Client error '404 Not Found' for url 'http://localhost:6015/v2/suggest/sections/9/11%20conspiracy%20theories/en/vi'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404

and the related stack trace (also shown on logstash)

Not sure it accounts for all the errors we're seeing since some are intermittent but from the log above, it looks like some article title encoding is insufficient when it contains a forward slash (9/11 conspiracy theories). When the slash is intact, it results in a URL path that CX server doesn't recognize, hence the 404. I tested it locally successfully. Patch incoming.

Change #1074423 had a related patch set uploaded (by Sbisson; author: Sbisson):

[research/recommendation-api@master] Encode forward slash in article title before including in URL

https://gerrit.wikimedia.org/r/1074423

Change #1074423 merged by jenkins-bot:

[research/recommendation-api@master] Encode forward slash in article title before including in URL

https://gerrit.wikimedia.org/r/1074423

Change #1074847 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update rec-api image

https://gerrit.wikimedia.org/r/1074847

Change #1074847 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update rec-api image

https://gerrit.wikimedia.org/r/1074847

Still seeing a lot of failures here where the rec-api fails quickly because it gets 503 from cx server. The client side app is set to retry several times so I see a lot of those failures quickly and then one calls takes some time and comes back successfully. Can it be that cx server doesn't have the capacity to serve the traffic from rec-api? How would we identify/troubleshoot that?

Change #1075142 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[research/recommendation-api@master] Add error handling for fetch section suggestion requests

https://gerrit.wikimedia.org/r/1075142

Change #1075142 merged by jenkins-bot:

[research/recommendation-api@master] Add concurrency and error handling for fetch section suggestion requests

https://gerrit.wikimedia.org/r/1075142

Hi @kevinbazira , would you be able to update rec-api in production to include the patch above? This is a bit time sensitive. Thanks!

Change #1075245 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update rec-api image

https://gerrit.wikimedia.org/r/1075245

Change #1075231 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20240925

https://gerrit.wikimedia.org/r/1075231

Change #1075245 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update rec-api image

https://gerrit.wikimedia.org/r/1075245

would you be able to update rec-api in production to include the patch above? This is a bit time sensitive. Thanks!

Hi @SBisson, the patch for concurrency and error handling to fetch section suggestions has been deployed in production.

Change #1075231 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20240925

https://gerrit.wikimedia.org/r/1075231

Change #1075567 had a related patch set uploaded (by Sbisson; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@wmf/1.43.0-wmf.24] CX3 Build 0.2.0+20240925

https://gerrit.wikimedia.org/r/1075567

Change #1075567 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@wmf/1.43.0-wmf.24] CX3 Build 0.2.0+20240925

https://gerrit.wikimedia.org/r/1075567

Mentioned in SAL (#wikimedia-operations) [2024-09-25T14:22:59Z] <kartik@deploy1003> Finished scap sync-world: Backport for [[gerrit:1075567|CX3 Build 0.2.0+20240925 (T374387 T370746 T368422 T374567 T355780 T374559 T374886 T375410)]] (duration: 14m 06s)

Change #1088276 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update recommendation-api to 2024-11-06-190017-production

https://gerrit.wikimedia.org/r/1088276

SBisson claimed this task.

This has been stable in development and production for over a month. Calling it resolved!