Page MenuHomePhabricator

Make it possible to access the Realtime API and On-demand API without authentication
Open, Needs TriagePublicFeature

Description

Feature summary (what you would like to be able to do and where):

I would like to access (instances of) the Realtime API and On-demand API publicly without authentication.

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):

As a community developer I would like to integrate the Realtime API and On-demand API into public end-user tools without authentication. Currently I can either expose authentication information in clients or create open proxies, both options are bad enough to disqualify.

Benefits (why should this be implemented?):

Allowing one to access the APIs publicly would make it possible for developers to integrate it into end-user software without the creation of open proxies or by exposing authentication credentials.

Given the claim that Wikimedia Enterprise makes money of "speed and volume of throughput, the SLA guaranteeing stability/uptime/response time, and the ‘hold my hand’ customer service", setting up public instances of these APIs under the same terms as other public APIs should not impact Wikimedia Enterprise sales.

Event Timeline

@Abbe98 Let me look into your requests. I will get back to you in a few days.
In the meantime, a request which I think is similar to yours was previously posted on our talk page back in January. Please check if my response there addresses any of your questions/concerns:
https://meta.wikimedia.org/wiki/Talk:Wikimedia_Enterprise#c-HShaikh_(WMF)-20240112213100-LWyatt_(WMF)-20240104000900

Hello @Abbe98. Sorry for the delay. Thank you for your patience in waiting for my reply. I’m posting a reply here in Phabricator (and @LWyatt will link to it on the Mastodon thread where this question originated). But moreover, given the question you have asked might come up again in the future, we will add “why does the API require authentication?” to our FAQ on metawiki soon.

I see in the Mastodon discussion you had recently about Enterprise that:
“I would love to use some of these new capabilities for Govdirectory.org and FornPunkt.se…” but that you “...can't use these new capabilities or build tooling around them as they would just become open proxies for WMFs paid service.”

I understand that you would like to have the tool/code you create to be open sourced and open access, and also from your description above that you would also like to incorporate content from the Enterprise API within them. However, I am not understanding how that would necessarily make it an open proxy for use. From my understanding, most cases where APIs are being incorporated into code that is meant to be open sourced can be done via use of libraries or config files. The developer in this case uses the config file to allow for an easy credential substitution and the user can sign up for the service on their own. Will such a usage pattern be aligned with your designs for “open access” usage?

In the case that you described, where credentials are openly shared to simulate anonymous use where authentication is not needed. One problem that arises from that approach is the inability of the service’s owner to be able to know if there are individual users who are using a disproportionate amount of resources, or even if there is an abuser in the mix. This is the classic “tragedy of the commons'' situation. Also use of authentication here is not to keep an eye on who is using the API and for what reason. Anyone can use anonymous email accounts to create an account and use that without disclosing any personal information. It is mostly meant to keep users of the APIs updated on changes as well and to give warnings for deprecations as well. This all assumes that the user of the APIs has not abandoned the Open source project and someone is still maintaining it. I also agree that the current barrier to entry for non-enterprise users with the Auth mechanism is not straightforward and as we see more users trying to use it in smaller capacity we are looking to add open libraries that will help with access key management and make it easier to authenticate to our services.

The inception of the Enterprise APIs was rooted in the fact that we need to provide the very high volume reusers of Wikimedia APIs a separate pipeline for the same content, and built using whatever technical formats (and service requirements) those reusers prefer. This removes a major part of their traffic from the existing Wikimedia APIs, freeing up the bandwidth for public use. The creation of unauthenticated use of Enterprise APIs would have the same “tragedy of the commons” effect, but on those APIs too - and the SLAs would still need to be maintained, at even greater expense. Wikimedia reusers being a major user of the bandwidth is tracked by the WMF and is at times enforced as well but not all reusers are easy to identify due to the anonymous usage of the APIs. Our awesome SRE team just handles these things very well.

I am not understanding how that would necessarily make it an open proxy for use. From my understanding, most cases where APIs are being incorporated into code that is meant to be open sourced can be done via use of libraries or config files.

Could you share an example of how you can protect credentials within a web-based Javascript client? I don't see how that would be possible.

One problem that arises from that approach is the inability of the service’s owner to be able to know if there are individual users who are using a disproportionate amount of resources, or even if there is an abuser in the mix. This is the classic “tragedy of the commons'' situation. Also use of authentication here is not to keep an eye on who is using the API and for what reason. Anyone can use anonymous email accounts to create an account and use that without disclosing any personal information. It is mostly meant to keep users of the APIs updated on changes as well and to give warnings for deprecations as well. This all assumes that the user of the APIs has not abandoned the Open source project and someone is still maintaining it.

This could be said for all Wikimedia APIs.

The creation of unauthenticated use of Enterprise APIs would have the same “tragedy of the commons” effect, but on those APIs too - and the SLAs would still need to be maintained, at even greater expense. Wikimedia reusers being a major user of the bandwidth is tracked by the WMF and is at times enforced as well but not all reusers are easy to identify due to the anonymous usage of the APIs.

That's why I'm suggesting that separate instances of these APIs are setup on WMF infrastructure with the same terms as other public APIs.

@Abbe98 thanks for the explicit ask on the code snippet scenario it clarifies the asks for me. I was under the impression this was something that could be done server side. In the case of the client side only scripts, the solution would not be simple to implement. Some possibilities are listed below (I understand that some of these might be options that can not be used or are ill advised for some use cases):

  1. Ask the user of the javascript to authenticate via username and password. That way you are facilitating the use of APIs but the procurement of the credentials is left to the end user. (Then follow best practices for storing and managing api keys client side for the session)
  2. Use a proxy. The WMCS provides authentication free access to Enterprise APIs. Having code running there that ferries the requests over is a possibility.
  3. Have a server side auth management implemented that allows for client side calls to be made. This would be more complicated than the proxy mentioned in 2. But could allow for a longer running instance which might be limited on WMCS (Authenticate and obtain a JWT: Typically done on the server-side. You send user credentials to your server, and the server returns a JWT.)

For your second point that “it is true for WMF APIs as well”, I do not dispute that. This is something that the SRE team on WMF also has to handle and from the data I have seen there are times when traffic from some specific user agents is way over the recommended limit, which can - and does - cause momentary failures on API calls. This can be solved by retrying later, but it is a non-ideal scenario that inconveniences the intended users: the community.

For the creation of such services on the WMF infrastructure and provided through the existing APIs is a request that I am afraid is out of my wheelhouse. The mandate for the Enterprise team was and is specifically set to provide services for and optimized for large reusers. The product we have created is based on research with that in mind. To ensure equity, the creation of this will not create features that are contractually exclusive to paying customers and will not disrupt the existing WMF API offering. (Infact all Enterprise offerings are dependent on the full functionality of the WMF APIs to continue existing). Thus access is available to Enterprise APIs but it is not following the regular WMF API mechanism. The creation of such APIs for community use is not something on our short term road map as that is the scope of other teams at the Foundation. However, we will be re-structuring the no-cost version of the existing Enterprise API to make sure it can be used continually by community etc, not merely as a temporary trial (I know this is not the same as the issue about Auth, it is an attempt to show the spirit of sharing and lowering the playing field). In the long term we plan to do cooperative work with the mediawiki team to make it easier but the timeline and prioritization is dependent on annual plan priorities .
A client side API would need to be created with the community in mind and led by a team dedicated to supporting that case. (I don’t mean to shift the work to “someone else” but the mandate and capacity for the staff that I am responsible for, the Enterprise team, doesn’t give us room to take on a second offering for a different audience). The Enterprise team would be happy to work with other teams to facilitate the creation of such a product.

We have run many recorded video calls for the community to discuss Enterprise API over previous years - and if you would like we would be happy to either have a call scheduled with you (and anyone who would like to join). A real-time conversation would be a more efficient way to have this conversation, if that would be of interest to you?

One problem that arises from that approach is the inability of the service’s owner to be able to know if there are individual users who are using a disproportionate amount of resources, or even if there is an abuser in the mix. This is the classic “tragedy of the commons'' situation. Also use of authentication here is not to keep an eye on who is using the API and for what reason. Anyone can use anonymous email accounts to create an account and use that without disclosing any personal information. It is mostly meant to keep users of the APIs updated on changes as well and to give warnings for deprecations as well. This all assumes that the user of the APIs has not abandoned the Open source project and someone is still maintaining it.

This could be said for all Wikimedia APIs.

Yes, and it's true there as well. From the POV of SRE and our normal public infrastructure and APIs, this is a key argument /for/ Enterprise. We do generally have issues where one or a few parties consume massive resources on our public endpoints, causing outages and interference for the general public access we're trying to offer. Enterprise is an authenticated interface that high-load consumers can be offloaded towards. Because the access there is not anonymous, excessive consumption can be negotiated for both paying and non-paying consumers and reasonable limits enforced on a true per-customer level, which in turn frees up capacity and resiliency on the general public service we offer via our public infrastructure. It also allows us to slowly ratchet down anti-abuse limits on individual IPs/UAs accessing the general open/shared interfaces we host, with the aim of preserving their uptime and utility for normal-scale consumption from the broader public.

[Also, "Tragedy of the Commons" is definitely not a myth when it comes to hosting important Internet resources on a shoestring budget and trying to keep them up and available for all in the face of sometimes-unmindful access patterns by a few accidentally-heavy users]

For the creation of such services on the WMF infrastructure and provided through the existing APIs is a request that I am afraid is out of my wheelhouse. The mandate for the Enterprise team was and is specifically set to provide services for and optimized for large reusers.

@LWyatt asked you here and pointed me here. If this is not under your responsibilities just say so instead of suggesting that I breach the terms of use to resolve my use cases.

@BBlack & @HShaikh at no point have I claimed that a few parties consuming massive resources isn't an issue, I know it's. I suggested setting up separate instances in my initial suggestion because of this and I don't see the issue blocking that; if you aren't making the argument to put existing public APIs behind auth as well.

If i understand correctly, @Abbe98 is describing a direct frontend integration, where traffic is not routed via a app specific backend which adds the token and forwards the traffic to the credentialed api endpoint?

This indeed would not be possible, as you would always disclose the credentials in the frontend. You always need to route the traffic via a backend, which adds extra implementation cost to developers used to the open wikimedia eco system.

Solutions around this could indeed be a api proxy host, where an app can route traffic to, filter on referrer and with configurable (by the app developer) rate limits etc. Maybe with some additional controls about which apis/paths the incoming traffic is allowed to use (a bit like an OAuth grant).

This would turn anonymous uncontrolled traffic of a specific app into controlled credentialed traffic for that app, and people who dont want to do this themselves do not have to reimplement it for each of their applications.

This is essentially what i do for nominatim on toolforge as well. I take in WM traffic, anonymise, rate limit and cache it, and then forward anonymous requests to OSM, including contact info of myself in case they do have an issue with our traffic

I think the issue mostly identified here, is that the cost of protecting the rate limits is now offloaded to the implementer, which is fine if your Google, but has a high impact on smalller users of the api.

Two things can be true at once. The Commons makes great things, but it also destroyed free tier CI pipelines because people ran bitcoin miners in them.