Page MenuHomePhabricator

Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing
Closed, ResolvedPublic

Description

In T347475, we discovered that the rec-api-ng container hosted on LiftWing was not able to access endpoints external to kubernetes.

To resolve this, we shall configure envoy settings to enable the rec-api-ng container hosted on LiftWing to access wikimedia, wikidata, and wikipedia apis.

Below are the steps we are going to take:

  1. add envoy listener(s) to the rec-api-ng deployment settings
  2. update the specified api urls to use localhost:<PORT> in recommendation_liftwing.ini
  3. configure requests to the above endpoints to send their respective Host headers (see example)

Event Timeline

Change 965142 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[research/recommendation-api@master] Use envoy proxy to access endpoints external to k8s/LiftWing

https://gerrit.wikimedia.org/r/965142

Change 965142 merged by jenkins-bot:

[research/recommendation-api@master] Use envoy proxy to access endpoints external to k8s/LiftWing

https://gerrit.wikimedia.org/r/965142

Change 965585 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update the recommendation-api-ng image

https://gerrit.wikimedia.org/r/965585

Change 965585 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update the recommendation-api-ng image

https://gerrit.wikimedia.org/r/965585

Change 966826 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[research/recommendation-api@master] Update external endpoint ports used on LiftWing

https://gerrit.wikimedia.org/r/966826

Change 966826 merged by jenkins-bot:

[research/recommendation-api@master] Update external endpoint ports used on LiftWing

https://gerrit.wikimedia.org/r/966826

Change 966827 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: add listeners for cxserver and eventgate-analytics to rec-api-ng

https://gerrit.wikimedia.org/r/966827

Change 966827 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add listeners for cxserver and eventgate-analytics to rec-api-ng

https://gerrit.wikimedia.org/r/966827

We configured envoy settings and the rec-api-ng container hosted on LiftWing is able to access endpoints external to k8s/LiftWing. Below are the settings we configured:

endpoint nameenpoint host headerenvoy listenerport
language_pairscxserver.wikimedia.orgcxserver6015
pageviewswikimedia.orgmw-api-int-async-ro6500
wikipedia{source}.wikipedia.orgmw-api-int-async-ro6500
wikidatawww.wikidata.orgmw-api-int-async-ro6500
event_loggerintake-analytics.wikimedia.orgeventgate-analytics6004

Change 967401 had a related patch set uploaded (by Elukey; author: Elukey):

[research/recommendation-api@master] Fix pageviews base endpoint

https://gerrit.wikimedia.org/r/967401

Change 967401 merged by Elukey:

[research/recommendation-api@master] Fix pageviews base endpoint

https://gerrit.wikimedia.org/r/967401

Change 966836 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update the recommendation-api-ng image

https://gerrit.wikimedia.org/r/966836

Change 966836 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update the recommendation-api-ng image

https://gerrit.wikimedia.org/r/966836

We discovered that the pageviews endpoint does not use mw-api-int-async-ro but rather aqs listener. The full settings that enabled the rec-api-ng hosted on LiftWing to access endpoints external to k8s/LiftWing through the envoy proxy are:

endpoint nameenpoint host headerenpoint urienvoy listener
language_pairscxserver.wikimedia.orghttp://localhost:6015/v1/languagepairscxserver
pageviewswikimedia.orghttp://localhost:6033/wikimedia.org/v1/metrics/pageviewsrest-gateway
wikipedia{source}.wikipedia.orghttp://localhost:6500/w/api.phpmw-api-int-async-ro
wikidatawww.wikidata.orghttp://localhost:6500/w/api.phpmw-api-int-async-ro
event_loggerintake-analytics.wikimedia.orghttp://localhost:6004/v1/events?hasty=trueeventgate-analytics

Update 1: Following T348607#9275348, the pageviews uri has been updated from http://localhost:6020/analytics.wikimedia.org/v1/metrics/pageviews to http://localhost:6020/analytics.wikimedia.org/v1/pageviews

Update 2: Following T348607#9283681, the pageviews uri has been updated from http://localhost:6020/analytics.wikimedia.org/v1/pageview to http://localhost:6033/wikimedia.org/v1/metrics/pageviews and envoy listener from aqs to rest-gateway

To help others in future who might have challenges to resolve the issue above, I have added a note to the envoy proxy docs that shows the same host could use different envoy listeners based on the endpoint being accessed:
https://wikitech.wikimedia.org/w/index.php?title=Envoy&oldid=2121603#Example_(calling_mw-api)

Change 967916 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[research/recommendation-api@master] Update pageviews external endpoint uri used on LiftWing

https://gerrit.wikimedia.org/r/967916

As reported in T347475#9269007, the pageviews endpoint is failing. I reproduced this issue on staging and here are the logs showing that the endpoint fails because the URI:

2023-10-24 05:09:12,783 recommendation.utils.event_logger log_api_request():37 INFO -- Logging event: {"schema": "TranslationRecommendationAPIRequests", "$schema": "/analytics/legacy/$translationrecommendationapirequests/1.0.0", "revision": 16261139, "event": {"timestamp": 1698124152, "sourceLanguage": "en", "targetLanguage": "es", "searchAlgorithm": "related_articles"}, "webHost": "recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443", "client_dt": "2023-10-24T05:09:12.783157", "meta": {"stream": "eventlogging_TranslationRecommendationAPIRequests", "domain": "recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443"}}
2023-10-24 05:09:12,801 recommendation.api.external_data.fetcher get():26 INFO -- Request failed: {"url": "http://localhost:6020/analytics.wikimedia.org/v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/22", "error": "404 Client Error: Not Found for url: http://localhost:6020/analytics.wikimedia.org/v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/22"}
2023-10-24 05:09:12,801 recommendation.api.external_data.fetcher get_most_popular_articles():143 INFO -- pageview query failed
2023-10-24 05:09:12,847 recommendation.api.types.translation.translation process_request():195 INFO -- Request processed in 0.064753 seconds
[pid: 135|app: 0|req: 7360/10465] 127.0.0.1 () {44 vars in 2182 bytes} [Tue Oct 24 05:09:12 2023] GET /types/translation/v1/articles?source=en&target=es&seed=&search=related_articles&application=CX => generated 3 bytes in 65 msecs (HTTP/1.1 200) 5 headers in 195 bytes (1 switches on core 0)

Change 967916 merged by jenkins-bot:

[research/recommendation-api@master] Update pageviews external endpoint uri used on LiftWing

https://gerrit.wikimedia.org/r/967916

Change 967917 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update recommendation-api-ng image

https://gerrit.wikimedia.org/r/967917

Change 967917 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update recommendation-api-ng image

https://gerrit.wikimedia.org/r/967917

Since we couldn't experiment and check the correct pageviews uri from k8s, an SRE had to log into one of the AQS nodes and tested the uri with curl until we got the correct one: http://localhost:6020/analytics.wikimedia.org/v1/pageviews. We've added it to the rec-api-ng, deployed it on staging, tested that it works, and then deployed it to prod.

Change 967921 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[research/recommendation-api@master] Update pageviews endpoint to match rest-gateway envoy listener port

https://gerrit.wikimedia.org/r/967921

Change 967921 merged by jenkins-bot:

[research/recommendation-api@master] Update pageviews endpoint to match rest-gateway envoy listener port

https://gerrit.wikimedia.org/r/967921

Change 967922 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: add rest-gateway listener for rec-api-ng

https://gerrit.wikimedia.org/r/967922

Change 967922 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add rest-gateway listener for rec-api-ng

https://gerrit.wikimedia.org/r/967922

Change 967925 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[research/recommendation-api@master] Update pageviews endpoint to match rest-gateway envoy listener

https://gerrit.wikimedia.org/r/967925

Change 967925 merged by jenkins-bot:

[research/recommendation-api@master] Update pageviews endpoint to match rest-gateway envoy listener

https://gerrit.wikimedia.org/r/967925

Change 968966 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update recommendation-api-ng image

https://gerrit.wikimedia.org/r/968966

Change 968966 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update recommendation-api-ng image

https://gerrit.wikimedia.org/r/968966

Folks from wikimedia-analytics notified us that they are migrating the pageviews endpoint as shown in the screenshot below:

notice - pageviews endpoint migration from retbase to rest-gateway : Screenshot from 2023-10-24 18-06-29.png (459×942 px, 83 KB)

We have updated the rec-api-ng envoy settings for the pageviews endpoint from:

uri: http://localhost:6020/analytics.wikimedia.org/v1/pageview
listener: aqs

to

uri: http://localhost:6033/wikimedia.org/v1/metrics/pageviews
listener: rest-gateway
calbon triaged this task as Medium priority.Nov 2 2023, 7:25 PM

The rec-api-ng container is now able to access endpoints external to k8s/LiftWing even after the wikimedia-analytics migrated the pageviews endpoint.