linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout"
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Urbanecm_WMF
	Jun 29 2023, 6:27 PM

Description

When trying to run extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php in my local development setup, I got a "HTTP request timed out". While debugging, I noticed the following:

murbanec@martins-mbp core % curl 'https://api.wikimedia.org/service/linkrecommendation/v1/linkrecommendations/wikipedia/cs/Farmakologick%C3%A9_metody_ti%C5%A1en%C3%AD_bolesti?threshold=0.7&max_recommendations=10&language_code=en'
{"httpCode":504,"httpReason":"upstream request timeout"}
murbanec@martins-mbp core %

Apparently, the linkrecommendation service itself is having some issues as of now.

Upon further investigation, I noticed that the production endpoint is down as well:

[urbanecm@deploy1002 ~]$ curl 'https://linkrecommendation.discovery.wmnet:4005/v1/linkrecommendations/wikipedia/cs/Barack_Obama?threshold=0.5&max_recommendations=15'
upstream connect error or disconnect/reset before headers. reset reason: connection termination
[urbanecm@deploy1002 ~]$

Details

Subject	Repo	Branch	Lines +/-
linkrecommendation: Add dbproxy1023 to network policies	operations/deployment-charts	master	+4 -0
linkrecommendation: Add dbproxy1025 to the networkpolicy block	operations/deployment-charts	master	+4 -0
helmfile.d: linkrecommendation: add dbproxy1027 to network policies	operations/deployment-charts	master	+4 -0

Customize query in gerrit

Related Objects

Mentioned In: T341710: Alert the Growth team when linkrecommendation service is unavailable
T340843: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over
T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies)
Mentioned Here: rODNSdbcbf47d17e4: wmnet: Failover m2-master
T340843: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over
T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies)
rODNS39c3aaa3b4d5: wmnet: Failover m2-master to dbproxy1025

Event Timeline

Urbanecm_WMF created this task.Jun 29 2023, 6:27 PM

Restricted Application added a project: Growth-Team. · View Herald TranscriptJun 29 2023, 6:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This also affects the internal endpoint; updated title/description to reflect that. Setting priority to high.

kostajh added projects: serviceops, SRE.Jun 29 2023, 6:49 PM

Change 934400 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/deployment-charts@master] helmfile.d: linkrecommendation: add dbproxy1027 to network policies

https://gerrit.wikimedia.org/r/934400

gerritbot added a project: Patch-For-Review.Jun 29 2023, 7:06 PM

Change 934402 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/deployment-charts@master] linkrecommendation: Add dbproxy1025 to the networkpolicy block

https://gerrit.wikimedia.org/r/934402

Change 934400 abandoned by Majavah:

[operations/deployment-charts@master] helmfile.d: linkrecommendation: add dbproxy1027 to network policies

Reason:

wrong proxy, 934402 has the correct one

https://gerrit.wikimedia.org/r/934400

Change 934402 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Add dbproxy1025 to the networkpolicy block

https://gerrit.wikimedia.org/r/934402

Maintenance_bot removed a project: Patch-For-Review.Jun 29 2023, 7:30 PM

This happened due to rODNS39c3aaa3b4d5: wmnet: Failover m2-master to dbproxy1025, where m2-master was falled back to dbproxy1025, as linkrecommendation didn't have the ability to connect to dbproxy1025 before. This needs a more permanent solution to avoid happening in the future, but the outage is now resolved.

Special thanks goes to @taavi, who identified the underlying cause of this problem -- thanks!

FYI @Marostegui as they did the DB master failover.

Thank you!!

JMeybohm mentioned this in T331894: Improve how we address outside k8s infrastructure from within charts (e.g. network policies).Jun 30 2023, 6:27 AM

Urbanecm_WMF edited projects, added Growth-Team (Sprint 0 (Growth Team)); removed Growth-Team.Jun 30 2023, 10:07 AM

Since this is fixed, should we resolve this? Do we need a followup for a more permanent solution as @Urbanecm_WMF suggests? Ideas on the more permanent solution?

In T340780#8980138, @akosiaris wrote:

Since this is fixed, should we resolve this?

This task probably can be resolved.

Do we need a followup for a more permanent solution as @Urbanecm_WMF suggests? Ideas on the more permanent solution?

I think a follow up would be nice, although I am not sure if it needs a separate task. I am also unsure what exactly can be done (apart from manually verifying the new server is in relevant network policies when doing a failover). I see @JMeybohm linked T331894 to this task, which seems to be a solution that'd prevent this issue from happening. Maybe that is sufficient and we don't need any other task at this point?

In T340780#8980148, @Urbanecm_WMF wrote:

In T340780#8980138, @akosiaris wrote:

Since this is fixed, should we resolve this?

This task probably can be resolved.

Thanks, done.

Do we need a followup for a more permanent solution as @Urbanecm_WMF suggests? Ideas on the more permanent solution?

I think a follow up would be nice, although I am not sure if it needs a separate task. I am also unsure what exactly can be done (apart from manually verifying the new server is in relevant network policies when doing a failover). I see @JMeybohm linked T331894 to this task, which seems to be a solution that'd prevent this issue from happening. Maybe that is sufficient and we don't need any other task at this point?

I 've filed T340843. We probably should have some discussion about how to best solve this and it's probably prudent to not have it in this task.

Urbanecm_WMF claimed this task.Jun 30 2023, 1:30 PM

Urbanecm_WMF updated the task description. (Show Details)Jul 10 2023, 5:21 PM

Urbanecm_WMF mentioned this in T341710: Alert the Growth team when linkrecommendation service is unavailable.Jul 12 2023, 4:24 PM

Change 952843 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/deployment-charts@master] linkrecommendation: Add dbproxy1023 to network policies

https://gerrit.wikimedia.org/r/952843

gerritbot added a project: Patch-For-Review.Aug 28 2023, 11:50 AM

This issue's happening again, for the same reasons (rODNSdbcbf47d17e4: wmnet: Failover m2-master is the failover commit in this case).

FYI @Marostegui, who merged https://gerrit.wikimedia.org/r/c/operations/dns/+/952213.

Change 952843 merged by jenkins-bot: