Page MenuHomePhabricator

Provide a public pull API endpoint
Closed, ResolvedPublicFeature

Description

Feature summary:

My experience with Wikimedia Enterprise is as a community member using public https://dumps.wikimedia.org/other/enterprise_html/.

I like the JSON format and contents of HTML dump. It is simple yet useful and contains various useful things. What I would like is that there would be a public API endpoint (or Mediawiki special page) which would return same JSON for an individual article. I think Enterprise has a pull API endpoint for this purpose. But I think it would be nice to be able also for general public to get data of an article in this way. From my understanding all data in the JSON is already available from multiple existing API endpoints but one would have to manually combine them into what this JSON provides. So having this JSON API endpoint would make this simpler and also make less load on Wikipedia servers if one would manually be building JSON through multiple API requests.

Use case(s):

Use case is when one wants contents of a single article/revision but in the same useful JSON format as what HTML dump provides. Sometimes one does not need data of all articles. Moreover, it is hard to find only one article in the whole dump (there is currently no index file provided for the dump).

Benefits:

Having API providing data in same JSON format means that same tools to process this JSON format can be used both for processing HTML dump and also individual articles. Moreover, having an API can improve chances that consistent format/structure/data is being provided (having to have another code path to reconstruct this data using other API endpoints might introduce slight differences).

Event Timeline

This is interesting - thanks for posting interest in this. I think this makes sense and I'm glad you're finding the JSON schema useful! It might take a little bit to plan this sort of thing out but I agree having this available does make sense especially sitting alongside the dumps.

Posted on the other ticket you made, but I'll keep this floating on our board and kick of some conversations with the team.

In meantime, is the code which produces current JSON available somewhere, open source? I could use that to generate similar JSONs for myself, while waiting for the API endpoint.

Hey @Mitar - open source code is published here + we'll be releasing a trial feature later this month that will allow you to pull the public endpoint. Thanks for your patience here.

Hi from the Enterprise Dev Team!

Have you been able to try our push service, withour Real-Time API?

Happy to have more feedback. I assume we can close this ticket now that the push endpoint has been up and running for a long time.

If I don't hear back, I'll close this ticket in a week.

Thank you
Ruairi O'Donnell

Is push endpoint available to the community members? Or just enterprise customers? Could you please provide a link to documentation? Maybe I missed something?

You are right. For the community, we don't offer push. Sorry for my miscommunication

The On-demand API offers single article response, instead of going through an entire project dump => https://enterprise.wikimedia.com/docs/on-demand/

And the article lookup API is not something which would be made publicly? Maybe at a deployment with lower SLA and rate limited differently than enterprise users? I simply find the contents aggregated in JSON per article very useful to get in one API query. Especially if you start with a dump and then want to fill in gaps or update local info about articles.

Can we first address this:
Is the On-Demand API (docs link I gave prior) what you're looking for in regards to functionality?
It grabs the current version of a single article. That is what you're asking for yes?

Secondly:
If the 'method' in which you obtain said data isn't agreeable I believe that to be a different issue from the data you're asking about. On-demand endpoint access is free for the first 10,000 requests with an account, that is also free. We don't offer other access at this time.

Our policy has always been that if there's a genuine community use-case (see the 4th bullet point under "free" at https://meta.wikimedia.org/wiki/Wikimedia_Enterprise#Access) that cannot be served by any of the existing other methods or tools, we can provide longer term access via the enterprise-portal at no cost upon request. If that's the case then create a new ticket with that context.

Feel free to sign up to use On-Demand (and snapshots, like the dumps you've been using): https://enterprise.wikimedia.com/signup/

@Mitar Preferably we would like it if you can signup and use the trial account to see if that meets your needs. But as mentioned earlier access to our APIs are also available through the WMCS services.
Following the Enterprise section on https://wikitech.wikimedia.org/wiki/Portal:Data_Services You can follow the link for PAWS to access the services and create a Jupyter lab instance to access the WME services.

Quick curl command in the jupyter shell shows that I can easily request articles by name:
e.g. curl https://api.enterprise.wikimedia.com/v2/articles/Gaza_Strip
Will get you the Json formatted version of https://en.wikipedia.org/wiki/Gaza_Strip

Is the On-Demand API (docs link I gave prior) what you're looking for in regards to functionality? It grabs the current version of a single article. That is what you're asking for yes?

It looks like it. I could not 100% confirm it, but it looks like it is returning data in this schema which is the same as what is in HTML enterprise dumps, so yes, this API looks what I am looking for in regards to functionality.

If that's the case then create a new ticket with that context.

That is interesting. Thank you. I will consider it. It might work for my use case (I am working on open source non-profit search engine over Wikipedia content).

But I still think that on-demand API should be simply offered as other Mediawikis APIs for free to the general community with some reasonable rate limiting defaults without having to specially ask for it. I understand the idea behind the Enterprise offering and that can include higher rate limits and higher SLAs and other types of API endpoints (real-time, push, etc.). But the pull API endpoints themselves should be open in my view. That would also enable open source community to build more tools around it. Like I have made for dumps but not for the enterprise API because I do not have access to it (yet).

creynolds claimed this task.

so yes, this API looks what I am looking for in regards to functionality.

Nice, that's great to hear!

re: rate limiting and non-email login access convo - out of scope for this ticket but feel free to hit up our wiki talk page.