Page MenuHomePhabricator

Collection data unavailable in several rec-api hosts
Closed, ResolvedPublic8 Estimated Story Points

Description

It was discovered Oct 8 2025 around noon UTC that collections were not working as expected in the CX dashboard.

Upon closer inspection, it looks like some recommendation api instances have their collection cache filed and some don't.

Based on a test deployment to staging, it looks like the calls from the rec-api to mediawiki are refused by the localhost:6500 proxy. (example)


Derived Requirement

All rec-api hosts must successfully retrieve and serve collection data to the CX dashboard. Calls from the recommendation API to MediaWiki must not be blocked or refused by the localhost:6500 proxy. Each rec-api instance must have a functioning and populated collection cache so that collection-based recommendations consistently load for all users.

Test Steps

Test Case 1: Verify rec-api Host Serves Collection Data

  1. Directly query a rec-api host endpoint that returns collection data (staging or beta).
  2. Observe the response payload.
  3. ✅❓❌⬜ AC1: Confirm that the rec-api host returns populated collection data rather than empty or missing results.

QA Results - Logstash

ACStatusDetails
1T406854#11403708

Event Timeline

SBisson triaged this task as Unbreak Now! priority.Oct 9 2025, 12:19 PM

Change #1194958 had a related patch set uploaded (by Sbisson; author: Sbisson):

[research/recommendation-api@master] Common log prefix for cache update code paths

https://gerrit.wikimedia.org/r/1194958

Change #1194958 merged by jenkins-bot:

[research/recommendation-api@master] Common log prefix for cache update code paths

https://gerrit.wikimedia.org/r/1194958

Change #1194971 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update Recommendation API to 2025-10-09-145754-production

https://gerrit.wikimedia.org/r/1194971

Change #1194971 merged by jenkins-bot:

[operations/deployment-charts@master] Update Recommendation API to 2025-10-09-145754-production

https://gerrit.wikimedia.org/r/1194971

SBisson added a project: Machine-Learning-Team.

Tagging Machine-Learning-Team in case they can provide insight into what's happening with the localhost:6500 proxy

SBisson lowered the priority of this task from Unbreak Now! to Medium.Oct 10 2025, 3:17 PM

It looks like all rec-api instances have been able to update their cache in the last hour.

I'm hesitant to close this task since there was clearly a problem that we don't understand but I'm not sure what should be the next step of the investigation. Maybe even more logging?

SBisson updated Other Assignee, removed: KartikMistry.
SBisson moved this task from In-progress to Incoming on the LPL Hypothesis board.

Pushing back to incoming since I'm not working on it and I'm not even sure what should be done.

SBisson claimed this task.

I don't think we can do anything about it at this point.

We're seeing this problem again right now.

Change #1201718 had a related patch set uploaded (by Sbisson; author: Sbisson):

[research/recommendation-api@master] Improve periodic update flow and error handling

https://gerrit.wikimedia.org/r/1201718

Change #1201718 merged by jenkins-bot:

[research/recommendation-api@master] Improve periodic update flow and error handling

https://gerrit.wikimedia.org/r/1201718

Change #1202376 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update Recommnedation API to 2025-11-05-230545-production

https://gerrit.wikimedia.org/r/1202376

Change #1202376 merged by jenkins-bot:

[operations/deployment-charts@master] Update Recommnedation API to 2025-11-05-230545-production

https://gerrit.wikimedia.org/r/1202376

Change #1203450 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update Recommnedation API to 2025-11-07-162011-production

https://gerrit.wikimedia.org/r/1203450

Change #1203450 merged by jenkins-bot:

[operations/deployment-charts@master] Update Recommnedation API to 2025-11-07-162011-production

https://gerrit.wikimedia.org/r/1203450

SBisson raised the priority of this task from Medium to High.Thu, Nov 13, 6:07 PM
SBisson moved this task from Needs QA to In-progress on the LPL Hypothesis board.

This is happening again.

According to log stash all instances have updated their cache successfully but when queried, the /page-collections endpoint is returning [] about 75% of the time.

Change #1204944 had a related patch set uploaded (by Sbisson; author: Sbisson):

[research/recommendation-api@master] Error handling and update schedule

https://gerrit.wikimedia.org/r/1204944

Change #1204944 merged by jenkins-bot:

[research/recommendation-api@master] Error handling and update schedule

https://gerrit.wikimedia.org/r/1204944

Nikerabbit set the point value for this task to 8.

Change #1206859 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update Recommendation API to 2025-11-17-092813-production

https://gerrit.wikimedia.org/r/1206859

Change #1206859 merged by jenkins-bot:

[operations/deployment-charts@master] Update Recommendation API to 2025-11-17-092813-production

https://gerrit.wikimedia.org/r/1206859

Mentioned in SAL (#wikimedia-operations) [2025-11-18T13:35:51Z] <kart_> Update Recommendation API to 2025-11-17-092813-production (T406854)

@SBisson Recommendation API is showing that it's not emptying the cache, as seen in the screenshot. I will move this to Sign-off. Thanks for all your work!

Test Result - Logstash

Status: ✅ PASS
Environment: Logstash
OS: macOS Tahoe 26.1
Browser: Chrome 142
Device: MBA
Emulated Device: NA

Test Artifact(s):

Test Steps

Test Case 1: Verify rec-api Host Serves Collection Data

  1. Directly query a rec-api host endpoint that returns collection data (staging or beta).
  2. Observe the response payload.
  3. AC1: Confirm that the rec-api host returns populated collection data rather than empty or missing results.

2025-11-24_16-15-38.png (881×1 px, 409 KB)

GMikesell-WMF updated the task description. (Show Details)
GMikesell-WMF moved this task from Needs QA to Design Signoff on the LPL Hypothesis board.