When handling GET requests, we currently make HTTP requests directly to e.g. cs.wikipedia.org/api.php. We should use Envoy instead.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | MMiller_WMF | T252822 [EPIC] Growth: "add a link" structured task 1.0 | |||
Resolved | kostajh | T266437 Add a link engineering: backend product specifications | |||
Resolved | kostajh | T261396 Add a link: engineering tasks for initial release | |||
Resolved | hnowlan | T269581 Add Link engineering: Allow external traffic to linkrecommendation service | |||
Resolved | kostajh | T276217 Use Envoy for making GET requests to lang.wikipedia.org/api.php |
Event Timeline
Change 667868 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/deployment-charts@master] [WIP] Use Envoy for requests to MediaWiki API
So AIUI what needs to happen:
- add the mwapi-async listener to the helmfile values.yaml (done in the WIP patch)
- change the application code, currently the code which calls MW API looks like this:
getPageDict(page_title, wiki_id, os.environ.get("MEDIAWIKI_API_URL")) [...] def getPageDict(title: str, wiki_id: str, api_url: str = None) -> dict: [...] api_url = api_url or "https://{0}.wikipedia.org/w/api.php".format( wiki_id.replace("wiki", "").replace("_", "-") ) headers = {"User-Agent": "mwaddlink"} req = requests.get(api_url, headers=headers, params=params)
So in the helm chart, we should be setting MEDIAWIKI_API_URL to https://api-rw.discovery.wmnet/w/api.php (although it looks like some charts use localhost:6500 instead) and then the getPageDict() method above should add a header for e.g. {"Host": "cs.wikipedia.org"}when calling the API endpoint.
Does this sound correct @JMeybohm?
You need to set MEDIAWIKI_API_URL to your local envoy listener instance http://localhost:6500/w/api.php (which will make your request flow through envoy towards mw-api). The Host header looks fine to me (although I must admit that I'm guessing here).
Indeed you should enable the mwapi-async listener and then:
- change MEDIAWIKI_API_URL to http://localhost:6500/w/api.php
- send the Host header with the actual wiki host you want to reach
Change 668028 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Use Host header for GET requests made via Envoy
One more question, as we just need read-only access, should be using https://api-ro.discovery.wmnet/w/api.php as I see some services are doing?
Unfortunately there is on advantage/enforcement of the -ro endpoint at this time, so we decided against adding it as envoy listener currently as it would just duplicate things.
Change 668028 merged by jenkins-bot:
[research/mwaddlink@main] Use Host header for GET requests made via Envoy
Change 667868 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Use Envoy for requests to MediaWiki API
This works \o/ https://api.wikimedia.org/service/linkrecommendation/v0/linkrecommendations/cswiki/Lipsko?max_recommendations=1
The timeouts when more time is needed to process a request can be dealt with in T277297: 504 timeout and 503 errors when accessing linkrecommendation service
Change 673473 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Enable monitoring of GET endpoint for linkrecommendations
Change 673473 merged by jenkins-bot:
[research/mwaddlink@main] Enable monitoring of GET endpoint for linkrecommendations
@akosiaris @JMeybohm @hnowlan, @Tgr pointed out that our application code could be simplified a bit if we called https://api.wikimedia.org/core/v1/wikipedia/en/page/{title} instead of proxying requests to the /w/api.php MediaWiki endpoint. Should we do that? Is it possible, and would we still need a listener with our helm chart?
Going through the edge caches (as those are the ones responsible for api.wikimedia.org) ? No, I am afraid that's not a good pattern, for a number of reasons, mostly revolving around polluting the edge caches with non organic content, artificially increasing traffic to them, making operations during maintenance or a switchover more difficult and also complicating things during debugging (it's way more easy, especially under stress, to reason about component X talking to component Y directly instead of component X talking to Y via components A, B and C).
OK, thanks. What about adding a listener for communicating with /w/rest.php instead of /w/api.php, do we have a listener for that currently? (See also T278718: For external traffic release, perform lookup in GrowthExperiments cache before generating recommendations)
I haven't checked, but that endpoint is powered by mediawiki so should already be possible using the mwapi-async listener. Feel free to use it.
Change 675753 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):
[research/mwaddlink@main] Use rest.php for handling GET requests
I don't think we care about caching (although we have no reason not to take advantage of it if it's possible), what we'd like is to avoid replicating the api-gateway project-to-domain mapping logic (although for now we only support wikipedias so it's simple). So I think we'd want to go through the API gateway but not necessarily the edge caches.
Change 675753 merged by jenkins-bot:
[research/mwaddlink@main] Use rest.php for handling GET requests