Page MenuHomePhabricator

Use Envoy for making GET requests to lang.wikipedia.org/api.php
Closed, ResolvedPublic

Description

When handling GET requests, we currently make HTTP requests directly to e.g. cs.wikipedia.org/api.php. We should use Envoy instead.

Event Timeline

kostajh renamed this task from Use envoy for making GET requests to lang.wikipedia.org/api.php to Use Envoy for making GET requests to lang.wikipedia.org/api.php.Mar 2 2021, 1:15 PM

Change 667868 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/deployment-charts@master] [WIP] Use Envoy for requests to MediaWiki API

https://gerrit.wikimedia.org/r/667868

So AIUI what needs to happen:

  1. add the mwapi-async listener to the helmfile values.yaml (done in the WIP patch)
  1. change the application code, currently the code which calls MW API looks like this:
getPageDict(page_title, wiki_id, os.environ.get("MEDIAWIKI_API_URL"))

[...]

def getPageDict(title: str, wiki_id: str, api_url: str = None) -> dict:
   [...]
    api_url = api_url or "https://{0}.wikipedia.org/w/api.php".format(
        wiki_id.replace("wiki", "").replace("_", "-")
    )
    headers = {"User-Agent": "mwaddlink"}
    req = requests.get(api_url, headers=headers, params=params)

So in the helm chart, we should be setting MEDIAWIKI_API_URL to https://api-rw.discovery.wmnet/w/api.php (although it looks like some charts use localhost:6500 instead) and then the getPageDict() method above should add a header for e.g. {"Host": "cs.wikipedia.org"}when calling the API endpoint.

Does this sound correct @JMeybohm?

So in the helm chart, we should be setting MEDIAWIKI_API_URL to https://api-rw.discovery.wmnet/w/api.php (although it looks like some charts use localhost:6500 instead) and then the getPageDict() method above should add a header for e.g. {"Host": "cs.wikipedia.org"}when calling the API endpoint.

Does this sound correct @JMeybohm?

You need to set MEDIAWIKI_API_URL to your local envoy listener instance http://localhost:6500/w/api.php (which will make your request flow through envoy towards mw-api). The Host header looks fine to me (although I must admit that I'm guessing here).

So AIUI what needs to happen:

  1. add the mwapi-async listener to the helmfile values.yaml (done in the WIP patch)
  1. change the application code, currently the code which calls MW API looks like this:
getPageDict(page_title, wiki_id, os.environ.get("MEDIAWIKI_API_URL"))

[...]

def getPageDict(title: str, wiki_id: str, api_url: str = None) -> dict:
   [...]
    api_url = api_url or "https://{0}.wikipedia.org/w/api.php".format(
        wiki_id.replace("wiki", "").replace("_", "-")
    )
    headers = {"User-Agent": "mwaddlink"}
    req = requests.get(api_url, headers=headers, params=params)

So in the helm chart, we should be setting MEDIAWIKI_API_URL to https://api-rw.discovery.wmnet/w/api.php (although it looks like some charts use localhost:6500 instead) and then the getPageDict() method above should add a header for e.g. {"Host": "cs.wikipedia.org"}when calling the API endpoint.

Does this sound correct @JMeybohm?

Indeed you should enable the mwapi-async listener and then:

  • change MEDIAWIKI_API_URL to http://localhost:6500/w/api.php
  • send the Host header with the actual wiki host you want to reach

Change 668028 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Use Host header for GET requests made via Envoy

https://gerrit.wikimedia.org/r/668028

Thanks @Joe and @JMeybohm, I'll give it a try.

One more question, as we just need read-only access, should be using https://api-ro.discovery.wmnet/w/api.php as I see some services are doing?

One more question, as we just need read-only access, should be using https://api-ro.discovery.wmnet/w/api.php as I see some services are doing?

Unfortunately there is on advantage/enforcement of the -ro endpoint at this time, so we decided against adding it as envoy listener currently as it would just duplicate things.

Change 668028 merged by jenkins-bot:
[research/mwaddlink@main] Use Host header for GET requests made via Envoy

https://gerrit.wikimedia.org/r/668028

Change 667868 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Use Envoy for requests to MediaWiki API

https://gerrit.wikimedia.org/r/667868

Change 673473 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Enable monitoring of GET endpoint for linkrecommendations

https://gerrit.wikimedia.org/r/673473

Change 673473 merged by jenkins-bot:
[research/mwaddlink@main] Enable monitoring of GET endpoint for linkrecommendations

https://gerrit.wikimedia.org/r/673473

@akosiaris @JMeybohm @hnowlan, @Tgr pointed out that our application code could be simplified a bit if we called https://api.wikimedia.org/core/v1/wikipedia/en/page/{title} instead of proxying requests to the /w/api.php MediaWiki endpoint. Should we do that? Is it possible, and would we still need a listener with our helm chart?

@akosiaris @JMeybohm @hnowlan, @Tgr pointed out that our application code could be simplified a bit if we called https://api.wikimedia.org/core/v1/wikipedia/en/page/{title} instead of proxying requests to the /w/api.php MediaWiki endpoint. Should we do that? Is it possible, and would we still need a listener with our helm chart?

Going through the edge caches (as those are the ones responsible for api.wikimedia.org) ? No, I am afraid that's not a good pattern, for a number of reasons, mostly revolving around polluting the edge caches with non organic content, artificially increasing traffic to them, making operations during maintenance or a switchover more difficult and also complicating things during debugging (it's way more easy, especially under stress, to reason about component X talking to component Y directly instead of component X talking to Y via components A, B and C).

@akosiaris @JMeybohm @hnowlan, @Tgr pointed out that our application code could be simplified a bit if we called https://api.wikimedia.org/core/v1/wikipedia/en/page/{title} instead of proxying requests to the /w/api.php MediaWiki endpoint. Should we do that? Is it possible, and would we still need a listener with our helm chart?

Going through the edge caches (as those are the ones responsible for api.wikimedia.org) ? No, I am afraid that's not a good pattern, for a number of reasons, mostly revolving around polluting the edge caches with non organic content, artificially increasing traffic to them, making operations during maintenance or a switchover more difficult and also complicating things during debugging (it's way more easy, especially under stress, to reason about component X talking to component Y directly instead of component X talking to Y via components A, B and C).

OK, thanks. What about adding a listener for communicating with /w/rest.php instead of /w/api.php, do we have a listener for that currently? (See also T278718: For external traffic release, perform lookup in GrowthExperiments cache before generating recommendations)

@akosiaris @JMeybohm @hnowlan, @Tgr pointed out that our application code could be simplified a bit if we called https://api.wikimedia.org/core/v1/wikipedia/en/page/{title} instead of proxying requests to the /w/api.php MediaWiki endpoint. Should we do that? Is it possible, and would we still need a listener with our helm chart?

Going through the edge caches (as those are the ones responsible for api.wikimedia.org) ? No, I am afraid that's not a good pattern, for a number of reasons, mostly revolving around polluting the edge caches with non organic content, artificially increasing traffic to them, making operations during maintenance or a switchover more difficult and also complicating things during debugging (it's way more easy, especially under stress, to reason about component X talking to component Y directly instead of component X talking to Y via components A, B and C).

OK, thanks. What about adding a listener for communicating with /w/rest.php instead of /w/api.php, do we have a listener for that currently? (See also T278718: For external traffic release, perform lookup in GrowthExperiments cache before generating recommendations)

I haven't checked, but that endpoint is powered by mediawiki so should already be possible using the mwapi-async listener. Feel free to use it.

Change 675753 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):
[research/mwaddlink@main] Use rest.php for handling GET requests

https://gerrit.wikimedia.org/r/675753

@akosiaris @JMeybohm @hnowlan, @Tgr pointed out that our application code could be simplified a bit if we called https://api.wikimedia.org/core/v1/wikipedia/en/page/{title} instead of proxying requests to the /w/api.php MediaWiki endpoint. Should we do that? Is it possible, and would we still need a listener with our helm chart?

Going through the edge caches (as those are the ones responsible for api.wikimedia.org) ? No, I am afraid that's not a good pattern, for a number of reasons, mostly revolving around polluting the edge caches with non organic content, artificially increasing traffic to them, making operations during maintenance or a switchover more difficult and also complicating things during debugging (it's way more easy, especially under stress, to reason about component X talking to component Y directly instead of component X talking to Y via components A, B and C).

OK, thanks. What about adding a listener for communicating with /w/rest.php instead of /w/api.php, do we have a listener for that currently? (See also T278718: For external traffic release, perform lookup in GrowthExperiments cache before generating recommendations)

I haven't checked, but that endpoint is powered by mediawiki so should already be possible using the mwapi-async listener. Feel free to use it.

👍 thanks

Going through the edge caches (as those are the ones responsible for api.wikimedia.org) ?

I don't think we care about caching (although we have no reason not to take advantage of it if it's possible), what we'd like is to avoid replicating the api-gateway project-to-domain mapping logic (although for now we only support wikipedias so it's simple). So I think we'd want to go through the API gateway but not necessarily the edge caches.

Change 675753 merged by jenkins-bot:

[research/mwaddlink@main] Use rest.php for handling GET requests

https://gerrit.wikimedia.org/r/675753