Page MenuHomePhabricator

WikiDev 16 working area: Content access and APIs
Closed, DeclinedPublic

Description

This is a potential area for work at Wikimedia-Developer-Summit-2016. "Content access and APIs" is about getting our data in-and-out of the system (e.g. rest.wikimedia.org). The central problem in this area: "how do we make accessing and distributing our data easier and more useful?"

Session proposals to consider in this area:

Other working areas (and the meta conversations about the idea of working areas) can/should be found here: T119018: Working groups/areas for macro-organization of RfCs for the summit

Related Objects

Event Timeline

RobLa-WMF claimed this task.
RobLa-WMF raised the priority of this task from to Needs Triage.
RobLa-WMF updated the task description. (Show Details)
RobLa-WMF updated the task description. (Show Details)
RobLa-WMF set Security to None.
RobLa-WMF renamed this task from T119022: WikiDev 16 working area: Content access and APIs to WikiDev 16 working area: Content access and APIs.Nov 19 2015, 12:50 AM

The number one most important topic listed here for me is T113210: How should Wikimedia software support non-Wikimedia deployments of its software?. Having a decision on the topic of non-Wikimedia usage of MediaWiki would make resolving many other technical and resourcing debates possible. For example: who will be supporting software not used on the Wikimedia cluster (databases, caching solutions, PHP runtimes, ...), who will be supporting features not used on the Wikimedia cluster (installers, package formats, ...), who will be supporting platforms not used on the Wikimedia cluster (anything other than the current WMF stack), when are the needs/desires of the Wikimedia cluster subordinate to the needs/desires of other MediaWiki users.

I believe the fundamental question is "does the Wikimedia Foundation see supporting third-party usage of MediaWiki and the health of MediaWiki as a FLOSS product as a fundamental component of their funded mission or is that a by-product that should be managed by volunteers?"

@GWicke - would you be available to help out with a last minute sweep through this area for any must haves, and make sure we have them on the "must have" list (T119593)? I'm going to try doing some more organization of this tonight so that we can collectively see what we neglected to get into the schedule, but I think someone (not primarily me) is going to need to be the champion for anything purported to answer the question "how do we make accessing and distributing our data easier and more useful?"

My apologies for not formally asking you sooner. I overestimated my ability to catch you for a quick conversation on the subject.

Here are the issues already on the tentative list at T119593: Define the list of "must have" sessions for WikiDev '16 (note, these are not certain choices, but are on an early list):

None of the others are on our tentative list yet. Which are the must haves in this area?

i would be most interested in discussing and working on the search / structured data rfc, recent changes and maps. dumps are quite important also and I would be interested.

T114019: Dumps 2.0 for realz (planning/architecture session)
T113526: Discuss the future of Maps and Geo-related projects at WMDS2016
T114474: More flexible and modernized Recent Changes code
T89733: Allow ContentHandler to expose structured data to the search engine.

Some initial grouping proposals:

SOA & third party

Content structure & APIs

This is in turn closely related to the API driven frontend track:

Media

API / feature feedback

Cross-wiki / namespace transclusion

Offline

Misc

Better in "software engineering" track?

Hackathon

Possibly not the best fit for the summit

Qgil triaged this task as Medium priority.Dec 11 2015, 8:11 AM

@GWicke: thanks for breaking these down by area! Do you have an opinion about which proposals are the most important to discuss at the summit, and which ones are merely "nice to have"?

The only proposal you're proposing to eliminate T114019: Dumps 2.0 for realz (planning/architecture session) has pretty broad support for further discussion. Why does that one not make the cut?

@Qgil: could you work with @GWicke to help prioritize and clarify this area? I fear that @GWicke's answer to the question "how do we make accessing and distributing our data easier and more useful?" doesn't account for the breadth of interest you allude to in T119593#1886999. We need to use this summit as an opportunity to bring these people together and hear each other, rather than divide them into tidy parallel sessions where they don't have to deal with the inconvenience of working together on common solutions.

@GWicke, thanks for the groupings! I think this clarifies (in my mind) what the scope of this area should be about.

I'll reiterate the central question, since I think it's critical to how we should think about highlighting proposals for this area:

How do we make accessing and distributing our data easier and more useful?

Below is my attempt to iterate on the taxonomy that @GWicke created. This is not in order of priority, but rather, the issues in "primary interests for this area" are basically assigned to this area, whereas the overlaps might be more likely to find champions in other areas.

I'm talking in the SF office with @GWicke about this now. He's pointed out T93396: Decide on format options for HTML and possibly other dumps to me, which seems to be stalled out. Is getting that unstalled relevant to this conversation?

@RobLa-WMF: I commented on T93396 as well. I think if that can be focused more on prioritization, then it could make for a good break-out session.

I haven't found the time to fully think through the schedule yet. I am not working today, but will work tomorrow. I'll have a detailed proposal by tomorrow noon.

Currently, "content format" and "software engineering" tracks have a lot of topics of general importance on their respective candidate lists:

  • Content structure, -storage and APIs
    • Media storage & APIs
  • API-driven frontend
  • SOA & third party users

Given the amount of material and overlap with content access and APIs, I think we should see if those can be covered adequately in those tracks, and if not consider using some time in this slot to make sure that they are adequately covered.

On the other hand, my impression is that several of the sessions listed as primary interest for the content access & API track seem to be focused on more specialized audiences.

T113540: What can the Search API do for you? and T112956: Developer summit session: Pageview API from the Event Bus perspective are basically presentation / feedback sessions about specific APIs, which could probably also be effective in a break-out session. T113526: Discuss the future of Maps and Geo-related projects at WMDS2016 seems to be more about feature brainstorming, again with a fairly focused audience. T89733: Allow ContentHandler to expose structured data to the search engine. was originally accepted, but seems to be in a "back to the drawing board" kind of situation after encountering some difficulties. The discussion seems to be fairly technical, and possibly better in a smaller group.

T114019: Dumps 2.0 for realz (planning/architecture session) has a fairly elaborate design proposal, but seems to lack clarity on the problem(s) it is trying to solve. The topic is again somewhat specialized, and I think better handled in a break-out session.

The offline RFCs T106898: Offline editing to support people with intermittent or no internet access (e.g. Kiwix, mobile) and T113004: Make it easy to fork, branch, and merge pages (or more) are ambitious and of relatively wide significance. My main concern with those sessions is that the number of participants familiar with the thorny issues around merging is likely to be relatively small. The express goal of T113004: Make it easy to fork, branch, and merge pages (or more) is to identify an attack strategy, but doesn't propose a concrete plan of action so far.

T114247: Data analysis with (python) MediaWiki-Utilities -- A unix philosophy-inspired collection of packages is a mix of a presentation, workshop & feedback session. I think it would work best in a break-out session.

Finally, T114474: RFC: More flexible and modernized ChangesList formatting for Recent Changes seems to be primarily a proposal to clean up the RecentChanges code in MediaWiki. It might be possible to discuss this topic in a regular RFC meeting.

I agree with Gabriel that having https://phabricator.wikimedia.org/T113540 informal and for feedback would be best. The phab task has already been really helpful to get some of that feedback. We can think about it un-conference style.

The part of the discussion that @Halfak and @Anomie are having at T112956#1909946 seem like a broader meta point for discussion in this area. In a nutshell, they are discussing the value of creating a shim in the action API to a system outside of our PHP-based MediaWiki core (the Pageview API discussed in T112956). @Halfak suggested that he and @Anomie discuss a possible MediaWiki-Action-API shim for ORES. He also mentioned conversations he's having with @GWicke about a RESTBase shim for ORES.

That's admittedly my cherrypicking of the conversation. Earlier conversation in T112956 suggested that @Nuria and @Milimetric probably have/had different opinions on this subject.

It would be good to have more of us agree on our aspiration in this area as a result of Wikimedia-Developer-Summit-2016

notes from the etherpad:

https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit_2016/T119029
https://phabricator.wikimedia.org/T119029
Slides:
https://docs.google.com/presentation/d/1G0adWzJBiBcaTupW7XfeLcFA35w8g6dNbOT6ZP6w_c8/edit?usp=sharing
https://etherpad.wikimedia.org/p/WikiDev16-AllNotes

Summary

(during notetaking, if there are enough notetakers, collaborate on a summary in addition to a full summary)

Questions discussed:

  • What should our APIs look like longer term?
    • REST vs RPC (speed+simplicity vs batching). We talked about cachability (and when that's important) and simplicity of each
  • How many APIs should we aim for?
  • How should we get there?

Other questions raised:
Are there benefits to simplicity of REST vs power of Action API?
How do you discover APIs?
Can we make api.php change more quickly?
Can we have someone improve the API documentation?
How much functionality is simpler when its site-specific vs cross-cutting?
Does introducing reliable layers make it the system more reliable or less?
Shared place for documentation?
How should data dumps be generated and updated?
How are we considering domain name choices for APIs? How do we balance centralization vs contextual access to APIs?

Roles:

Facilitator: Gabriel
Gatekeeper: Daniel, then Timo
Scriber: Katie (aude)
Timekeeper: Andrew (?)

Detailed notes

Gabriel: let's talk about high level picture of where we are going w/ apis

high level questions:

  • rest vs. rpc?
  • versioning for api? (action has format param, rest has major version in path, etc.)
  • how many apis to have? (one per service? one for everything? something in between?)
  • how to get there?

new functionality built on apis (e.g. apps)

main apis:

  • action api for serving requests (e.g. action=edit)
  • rest api (mostly content based)
  • other smaller, service apis

caching issues

  • can't easily cache api requests with query params (can be in different orders)

anomie: one api is not a good solution, rest is good for giving information about specific things (e.g. an image), rest is hard for requesting info about 500 things (use case of many bots, etc). we need both approaches.
gabriel: (batching vs speed+simplicity) generators may help with batching, or rest is quick and clients can handle batching?
gabriel: more high traffic endpoints, can we make them cachable? more complex interaction and queries might not have as much volume and need so much caching
bryan: rest v rpc question is false dichotomy, there are many varied and complex use cases. more content export requests align better with rest("tell me about this noun"), action can be a bit more suitable for query /search type requests
jordan: is this about restful? v rpc api? or specifically about action api?
gabriel: post vs cachable
jordan: could have restful, but have client libraries that provide more flexibility
_joe_: two apis serve different purposes, action can be overkill for simple things, for casual users restful can be more simple and easy; complex requests don't necessarily need caching or might be difficult
grpc
robla: reminder, topic is getting access to content, getting it in and out and infrastructure around that

dan: (had session earlier https://etherpad.wikimedia.org/p/WikiDev16-DataFlows), talked about general concept of data flows, processing and serving data and how it is handled by different teams (e.g. search, fundraising...) EventBus could solve different use cases and could work together on it.
dan: eventbus is for structuring data (with schema) about events through a stream and not worry so much about consumption part.

krinkle: batch requests in action api, very important use case for a lot of bots and tools; but, alternatives... also restbase is sometimes a proxy for soem other service or api (e.g pageview data)
moritz: performance? (hadoop)
dan: don't want to query hadoop, reliability issues (keep it up), sometimes have to restart it, only have one hadoop cluster, etc. separate storage layer for this data is important
brian: granularity in end points... restbase can be more defined in what you get (schema), nicer to work with as a developer
Gabriel: Do we want services to expose their APIs to users directly (ores.wikimedia.org), or via restbase. Shared generated documentation. Shared discovery.
corey: multiple apis... problem with discoverability of apis, so many, need better documentation
adam: api.php documentation is not so bad; higher performance apis should be more visible (e.g. documented)
anomie: Re ORES, I wonder whether part of the reason for its own API was because it was initially implemented on Labs, and production services can't depend on Labs. I hear they're looking at integrating with both restbase and the action API (e.g. T122689)
_joe_: we don't have a good site for information/documentation of our APIs. We should probably have somebody specifically assigned to do that work of organizing documenation for end users

jordan: we (Google) decouple the transport mechanism, endpoint, and libraries from the API surface area via grpc. similar deduping at WMF would allow for less effort to learn, build, and use new APIs, and also allows for purpose-specific API functions/property names. We don't have a PHP server-side binding but could help build one if desired.
krinkle: regarding merging of services behind restbase, focus on reliability and uptake; if ores is separate, not affected by downtime in restbase
_joe_: centralized not so good for stability
gabriel: varnish can help with stability
krinkle: shared place for documentation... S worked on it on mediawiki.org and krinkle on wikitech. there also is developer hub.
krinkle: also unified interface and versioning with centralized api
aude: we have https://www.mediawiki.org/wiki/How_to_contribute linked ("Developers") at the bottom of all wiki pages (e.g. enwiki), that prominently links to https://www.mediawiki.org/wiki/API:Web_APIs_hub (could be improved)
aude: wikidata also has a "Data access" link in the footer: https://www.wikidata.org/wiki/Wikidata:Data_access (which i think helps and people do notice)
corey: fetching Wikipedia+Commons from the same api is convenient, but maybe less common for wikipedia+wiktionary
Andrew Otto: "One API' does that mean, wikipedia.org/api and wikinews.org/api are similar, or that we migrate to rest.wikimedia.org/{domain}/api.
gabriel:

ariel: session about dumps https://etherpad.wikimedia.org/p/WikiDev16-T114019, main way users get data is dumps and datasets. dumps get slower and slower to run and old architecture is very difficult to maintain. would be good to toss the old system and have something new.
also to make incremental dumps available and feeds
Gabriel: We also have html dumps.
ariel: multiple formats coming out from different sources. makes it harder to export with differnet frequencies and formats. html dumps are a good example of this issue.
Gabriel: .. can be updated in place ..
Scott MacLeod: what is the strategy for translation moving forward with, for example, Wikipedia, Wikidata, Content Translation and Wikitionary, say from a Wikipedia article in Mandarin to a Wikipedia artiicle in English or from a CC MIT OCW course in Mandarin to one in English, and especially using Wiktionary and Content Translation? is this about API unification?
Gabriel: we need to ensure site data access policy is also handled
aude: incremental updates to dumps needs to be addressed (openstreetmap is pretty good example of how incremental dumps can work, with structured changesets stored in the database, tools to apply them to your copy of osm data. and the dumps + stable rest api have allowed proliferation of third parties using osm and tools available; approach could at least work for wikidata)
Gabriel: there's a connection between distribution and the APIs we use
Adam: how are we considering domain name choices?
krinkle: with restbase.wikimedia.org, can have one domain and could even share varnish cache (use case countervandalism project, makes api request to many different wikis based on recent changes, still uses api.php but could use restbase and maintain a connection to the api)

Per T124504: Transition WikiDev '16 working areas into working groups, we should still probably have a working group centered on the central question of this area:
"how do we make accessing and distributing our data easier and more useful?". I think it would be foolish to conflate that question with the central question of T119022, since I think both represent very difficult problems that each deserve potentially different expertise, interest, and passion to tackle. @GWicke, is this still an area you want to lead? Everyone else, do you feel comfortable following GWicke's lead?

@GWicke: Could you please answer the last comment? Thanks in advance.

@GWicke: Could you please answer RobLa-WMF's last comment? Thanks in advance!

No answers to T119029#1970858 so I'm proposing to close this rotting task about a conference session that already took place half a year ago.

No answers to T119029#1970858 so I'm proposing to close this rotting task about a conference session that already took place half a year ago.

@Aklapper , I'm also very frustrated about the lack of response from @GWicke on this task. Do you think it'd be productive to ask @GWicke in a different context?

It might be fine to close this task, though I'd offer the caveat that I believe a similar area should exist for WikiDev 17.

@Aklapper , I'm also very frustrated about the lack of response from @GWicke on this task. Do you think it'd be productive to ask @GWicke in a different context?

What about both of you just having a chat face to face in the WMF office? :)

...and finally declining the task to reflect reality.

Feel free to reopen if you plan to follow up on this task.