Page MenuHomePhabricator

linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout"
Closed, ResolvedPublic

Description

When trying to run extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php in my local development setup, I got a "HTTP request timed out". While debugging, I noticed the following:

murbanec@martins-mbp core % curl 'https://api.wikimedia.org/service/linkrecommendation/v1/linkrecommendations/wikipedia/cs/Farmakologick%C3%A9_metody_ti%C5%A1en%C3%AD_bolesti?threshold=0.7&max_recommendations=10&language_code=en'
{"httpCode":504,"httpReason":"upstream request timeout"}
murbanec@martins-mbp core %

Apparently, the linkrecommendation service itself is having some issues as of now.

Upon further investigation, I noticed that the production endpoint is down as well:

[urbanecm@deploy1002 ~]$ curl 'https://linkrecommendation.discovery.wmnet:4005/v1/linkrecommendations/wikipedia/cs/Barack_Obama?threshold=0.5&max_recommendations=15'
upstream connect error or disconnect/reset before headers. reset reason: connection termination
[urbanecm@deploy1002 ~]$

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Urbanecm_WMF renamed this task from External linkrecommendation API is down with HTTP 504: "upstream request timeout" to linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout".EditedJun 29 2023, 6:46 PM
Urbanecm_WMF triaged this task as High priority.
Urbanecm_WMF updated the task description. (Show Details)

This also affects the internal endpoint; updated title/description to reflect that. Setting priority to high.

Change 934400 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/deployment-charts@master] helmfile.d: linkrecommendation: add dbproxy1027 to network policies

https://gerrit.wikimedia.org/r/934400

Change 934402 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/deployment-charts@master] linkrecommendation: Add dbproxy1025 to the networkpolicy block

https://gerrit.wikimedia.org/r/934402

Change 934400 abandoned by Majavah:

[operations/deployment-charts@master] helmfile.d: linkrecommendation: add dbproxy1027 to network policies

Reason:

wrong proxy, 934402 has the correct one

https://gerrit.wikimedia.org/r/934400

Change 934402 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Add dbproxy1025 to the networkpolicy block

https://gerrit.wikimedia.org/r/934402

This happened due to rODNS39c3aaa3b4d5: wmnet: Failover m2-master to dbproxy1025, where m2-master was falled back to dbproxy1025, as linkrecommendation didn't have the ability to connect to dbproxy1025 before. This needs a more permanent solution to avoid happening in the future, but the outage is now resolved.

Special thanks goes to @taavi, who identified the underlying cause of this problem -- thanks!

FYI @Marostegui as they did the DB master failover.

Since this is fixed, should we resolve this? Do we need a followup for a more permanent solution as @Urbanecm_WMF suggests? Ideas on the more permanent solution?

Since this is fixed, should we resolve this?

This task probably can be resolved.

Do we need a followup for a more permanent solution as @Urbanecm_WMF suggests? Ideas on the more permanent solution?

I think a follow up would be nice, although I am not sure if it needs a separate task. I am also unsure what exactly can be done (apart from manually verifying the new server is in relevant network policies when doing a failover). I see @JMeybohm linked T331894 to this task, which seems to be a solution that'd prevent this issue from happening. Maybe that is sufficient and we don't need any other task at this point?

akosiaris claimed this task.

Since this is fixed, should we resolve this?

This task probably can be resolved.

Thanks, done.

Do we need a followup for a more permanent solution as @Urbanecm_WMF suggests? Ideas on the more permanent solution?

I think a follow up would be nice, although I am not sure if it needs a separate task. I am also unsure what exactly can be done (apart from manually verifying the new server is in relevant network policies when doing a failover). I see @JMeybohm linked T331894 to this task, which seems to be a solution that'd prevent this issue from happening. Maybe that is sufficient and we don't need any other task at this point?

I 've filed T340843. We probably should have some discussion about how to best solve this and it's probably prudent to not have it in this task.

Change 952843 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/deployment-charts@master] linkrecommendation: Add dbproxy1023 to network policies

https://gerrit.wikimedia.org/r/952843

This issue's happening again, for the same reasons (rODNSdbcbf47d17e4: wmnet: Failover m2-master is the failover commit in this case).

Change 952843 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Add dbproxy1023 to network policies

https://gerrit.wikimedia.org/r/952843

Urbanecm_WMF closed this task as Resolved.EditedAug 28 2023, 12:17 PM

Deployed the network policies change, service's up again.