Page MenuHomePhabricator

Re-implement reading lists REST interface outside RESTbase
Open, LowPublic

Description

The ReadingLists extension currently exposes an action API module implemented (see ReadingLists), which exposed through a REST API in RESTbase which also adds summaries to the output (see https://github.com/wikimedia/restbase/blob/master/v1/lists.js#L159 lists.js).

To unblock the sunsetting of RESTbase, the ReadingLists extension should implement its REST API in PHP and expose it through rest.php. This should be relatively easy by sharing code with the action API modules that are already implementing the logic.

The public endpoint URLs should be kept intact by applying the appropriate routing/rewrites in ATS or an API gateway. The new endpoint should follow the spec of the existing REST endpoint, see https://github.com/wikimedia/restbase/blob/master/v1/lists.yaml and https://en.wikipedia.org/api/rest_v1/#/Reading%20lists

Related Objects

StatusSubtypeAssignedTask
StalledNone
In ProgressNone
OpenBPirkle
ResolvedSeddon
Declinedcodebug
DeclinedFGoodwin
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
ResolvedBPirkle
OpenBPirkle
OpenBPirkle
OpenNone
ResolvedBPirkle
OpenBPirkle

Event Timeline

This should be relatively easy by sharing code with the action API modules that are already implementing the logic.

Not really. The RESTBase API fetches a list of (wiki ID, page ID) pairs from the action API (doesn't matter which wiki), then sends parallel the RESTBase summary endpoint to get the page titles and summaries (I think those are cached internally in RESTBase). I don't see a straightforward way of doing that in MediaWiki. It would probably require porting the summary endpoint for starters (which would have its benefits, e.g. T213505: RfC: OpenGraph descriptions in wiki pages, but again not straightforward), or at least building a persistent store of summaries that's not internal to RESTBase.

Not really. The RESTBase API fetches a list of (wiki ID, page ID) pairs from the action API (doesn't matter which wiki), then sends parallel the RESTBase summary endpoint to get the page titles and summaries (I think those are cached internally in RESTBase).

Ah, I missed the hydration aspect. I updated the task decription to mention it.

We can call the summary endpoint from inside MediaWiki. That would be the simplest solition, though it may result in a MW -> summary -> MW loop if the summary endpoint needs to fetch HTML from parsoid, and parsoid iutput is no longer cached in RESTbase.

Or we re-implement this as a standalone Node service instead.

@Jgiannelos do you have thoughts on this?

We can call the summary endpoint from inside MediaWiki. That would be the simplest solition, though it may result in a MW -> summary -> MW loop if the summary endpoint needs to fetch HTML from parsoid, and parsoid iutput is no longer cached in RESTbase.

There is no batch summary endpoint, and a list can include hundreds of items. Probably no client needs to show hundreds of items at the same time but even tens of parallel requests seem problematic. RESTBase gets around this problem because summaries are cached, plus the calls to the summary endpoint are internal and not actual web requests; so in practice it boils down to the reading lists endpoint doing a bunch of Cassandra lookups, parallelized by Node.js.

Or we re-implement this as a standalone Node service instead.

It can't really stand apart from the summary endpoint for the same reason.

API Platform has been asked to look at this, but we're not familiar with this feature or its implementation, so I'd like to take a step back and talk about context. Without being proscriptive about what should replace the current arrangement, can we dig a bit into what we have in place right now, for the benefit of those of us seeing this for the first time?

As I understand things, we have:

  • A user-facing feature in both the Android and iOS apps that allows creating lists of pages. Users might create these lists as bookmarks, things to read later, or whatever other purposes they find useful. Lists are private to the creating user and cannot be seen by others.
  • A public-facing API implemented in RESTBase that powers the feature.
  • Action API endpoints (via the ReadingLists extension)( that provide related backend data
  • The RESTBase API depends on (and internally calls) the Action API endpoints
  • The RESTBase API also performs caching of summaries. This is essential to the user-facing feature, as there is (currently) no efficient way to get this data directly on demand.
  • The Action API endpoints may be called either indirectly via the RESTBase API, or directly.

I see a Reading List Service grafana board but it doesn't look like it is actually gathering data. Data related to the RESTBase portion of the feature is lumped in with the RESTBase graphs but not in a way that feels helpful for viewing that particular service.

Turnillo suggests that that this is a relatively low traffic feature, and that essentially all hits come from our apps

The ActionAPI endpoints are also relatively low traffic, but low != 0.

I'm not a metrics wizard, and I feel like I may be missing something in the above links. So if there's a better way to get usage data for APIs related to this feature, please advise.

Who are the interested parties that we should keep in mind? The extension page lists Gergo as maintainer, and he's already here and contributing (thank you!). I see that the Maintainers page lists Content Transform as the Steward (and some of that team's members are already subscribers to this task). I assume that Mobile Apps should at least be advised that we're messing with things, but I don't see them here. I can contact them for a quick chat on their perspective. Anyone else?

I have some early thoughts about requirements for a non-RESTBase implementation, but I'd like to make sure I have the above correct before I get to far ahead of myself there. What am I missing, and what else should we know?

Thanks for looking into this @BPirkle! What you write mastly matches my own superficial understanding.

The RESTBase API also performs caching of summaries. This is essential to the user-facing feature, as there is (currently) no efficient way to get this data directly on demand.

I think that's not quite right - the client can get summaries directly: https://en.wikipedia.org/api/rest_v1/page/summary/Earth. As far as I know, the caching happens in the RESTbase wrapper for the Summary endpoint. If the client was to fetch the summaries directly, it would benefit from that cache. In addition, it would also benefit from varnish caches, which have a good hit rate for summaries, since it's used for popups in the web interface.

I assume that Mobile Apps should at least be advised that we're messing with things, but I don't see them here. I can contact them for a quick chat on their perspective.

We should definitly talk to them. Please keep @MSantos in the loop.

The main question is whether we could do the enrichment on the client side.

  • The RESTBase API depends on (and internally calls) the Action API endpoints

Probably worth emphasizing that a single RESTBase API call might involve Action API calls to multiple wikis.

I see a Reading List Service grafana board but it doesn't look like it is actually gathering data.

It uses per-endpoint RESTBase stats, presumably those were dropped at some point?

Who are the interested parties that we should keep in mind?

The API was written by the Product Infrastructure team and inherited by Content Transform via merger. It isn't really related to content transformations.

The Web team wrote a browser extension that uses the action API but it's abandoned and planned to be removed from the extension store. I think they also had plans for a minimal desktop UI, but it was more of a side project? @Jdlrobson would probably know.

As far as I know, the caching happens in the RESTbase wrapper for the Summary endpoint.

RESTBase endpoints can call each other without making a HTTP request, much like the Action API can call itself internally within the same PHP process. So code-wise the caching is handled by the summary endpoint, but in practice the reading lists endpoint still has access to it.

  • The RESTBase API depends on (and internally calls) the Action API endpoints

Probably worth emphasizing that a single RESTBase API call might involve Action API calls to multiple wikis.

Can you elaborate on that?

RESTBase endpoints can call each other without making a HTTP request, much like the Action API can call itself internally within the same PHP process. So code-wise the caching is handled by the summary endpoint, but in practice the reading lists endpoint still has access to it.

Yes, my point was that it would still be there if we re-implemented the ReadingLists endpoints outside RESTbase.

A reading list chunk returned by the API consists of up to 100 (wiki, page title) pairs. Those could be 100 different wikis. The readinglists/readinglistentries action APIs are wiki-agnostic, but you need to add some details to each entry (does it exist? what's the Wikidata description? etc). Currently the summary endpoint does that. If you want to replace that with MediaWiki APIs, you have to make cross-wiki API calls.

Yes, my point was that it would still be there if we re-implemented the ReadingLists endpoints outside RESTbase.

But currently each summary lookup is a direct Cassandra query. If you reimplemented it outside RESTBase, you'd have to do HTTP requests to RESTBase instead. With up to 100 entries per request, that's a pretty significant performance difference.

This could be a problem... To put this in perspective, I dug around and made a histogram of how many different projects are references from individual reading lists.

The vast majority of lists only uses one project. More than 99.9% use fewer than 10 sites. A total of 5 reading list use more than 100 wikis.

1wikiadmin2023@10.192.48.134(wikishared)> select np, count(*) as n from ( select rle_rl_id, count( distinct rle_rlp_id ) as np from reading_list_entry group by rle_rl_id order by np desc) t group by np;
2
3+-----+---------+
4| np | n |
5+-----+---------+
6| 1 | 2008280 |
7| 2 | 233705 |
8| 3 | 29183 |
9| 4 | 5847 |
10| 5 | 1879 |
11| 6 | 834 |
12| 7 | 423 |
13| 8 | 253 |
14| 9 | 149 |
15| 10 | 104 |
16| 11 | 52 |
17| 12 | 60 |
18| 13 | 35 |
19| 14 | 31 |
20| 15 | 22 |
21| 16 | 22 |
22| 17 | 16 |
23| 18 | 8 |
24| 19 | 14 |
25| 20 | 10 |
26| 21 | 4 |
27| 22 | 7 |
28| 23 | 3 |
29| 24 | 9 |
30| 25 | 3 |
31| 26 | 4 |
32| 27 | 3 |
33| 28 | 3 |
34| 29 | 2 |
35| 30 | 2 |
36| 31 | 4 |
37| 32 | 5 |
38| 35 | 4 |
39| 36 | 2 |
40| 37 | 4 |
41| 40 | 2 |
42| 41 | 2 |
43| 42 | 1 |
44| 46 | 1 |
45| 49 | 3 |
46| 51 | 1 |
47| 53 | 1 |
48| 54 | 1 |
49| 55 | 1 |
50| 56 | 1 |
51| 59 | 1 |
52| 60 | 1 |
53| 65 | 2 |
54| 66 | 1 |
55| 67 | 1 |
56| 76 | 1 |
57| 77 | 2 |
58| 79 | 2 |
59| 80 | 1 |
60| 82 | 4 |
61| 83 | 1 |
62| 84 | 1 |
63| 87 | 1 |
64| 91 | 1 |
65| 166 | 2 |
66| 206 | 1 |
67| 216 | 1 |
68| 272 | 1 |
69+-----+---------+

So we will have to make a cost/benefit decision on the long tail...

Putting together the above discussion, combined with some discussion from Slack, plus a few other things I noticed:

  • all the current RESTBase Reading Lists endpoints are marked as "unstable", which gives us some flexibility in making changes (should we choose to do so)
  • /data/lists is deprecated since 2018 and (per the deprecation message) subject to removal without warning
  • all the API endpoints provided by the Reading Lists extension are marked as internal/unstable. I'm not suggesting we change anything, just noting it is possible.
  • if T346739: Remove 'summary' field from reading list sync structures. proceeds as expected, the summary field will become unnecessary, allowing us to eliminate (or at least reduce) the summary caching concern
  • per the RESTBase summary schema, some additional fields (notably, thumbnail and description) are currently added by the summary hydration. But there are batch Action API endpoints allowing these to be efficiently pulled from MediaWiki (at least, more efficiently than summaries) if we wanted to get that information in different ways.
  • testing of the RESTBase endpoint /data/lists/{id}/entries does not show these additional fields actually being included in the response, although review of the code suggests they should be there. The reasons for this are currently unclear.
  • the summary hydration (and related information) is provided by the "summary service" within RESTBase, which is logically a separate part of the RESTBase code from the Reading Lists portion of the code. It would be technically possible to reimplement this Summary API outside of RESTBase in advance of reimplementing Reading Lists.
  • the cross-project nature of Reading Lists, especially combined with the need for hydration, suggests that it might be a good idea to create an external service to provide equivalents for the Reading Lists endpoints currently implemented in RESTBase. If we instead implement these via MediaWiki APIs, cross-wiki calls may be required. Alternatively, consumers might be able to do without the RESTBase endpoints and make the hydration call themselves, at the costs of pushing complexity to the consumers, and losing some caching opportunities.
  • regardless of what we do with the Reading Lists endpoints currently in RESTBase, the endpoints provided by the Reading Lists MW extension are expected to remain
  • most reading lists in the db use only one project. A very small number of reading lists use a dozens or hundreds of projects
  • because the Reading Lists feature is lightly used, it might be desirable to sunset it completely rather than reinventing it. (I'm not advocating for that, but the possibility was mentioned in discussion, so I'm bringing it here.)

Let me know if I misunderstood or misrepresented any of that.

the cross-project nature of Reading Lists, especially combined with the need for hydration, suggests that it might be a good idea to create an external service to provide equivalents for the Reading Lists endpoints currently implemented in RESTBase.

The "especially" is not needed there. The action API is wiki-agnostic, lists are stored in a central DB so you get the same list no matter which wiki you query it on. An external service would only be needed because of hydration (not just summary hydration though, any field that's not in the readinglistentries action API response). Otherwise doing it as a MediaWiki REST API would be straightforward.

testing of the RESTBase endpoint /data/lists/{id}/entries does not show these additional fields actually being included in the response, although review of the code suggests they should be there. The reasons for this are currently unclear.

No idea what's going on here but if the hydration functionality really is broken, and the apps are doing fine, we should just drop it from the code and go with the simpler implementation as above.

the summary hydration (and related information) is provided by the "summary service" within RESTBase, which is logically a separate part of the RESTBase code from the Reading Lists portion of the code. It would be technically possible to reimplement this Summary API outside of RESTBase in advance of reimplementing Reading Lists.

But note that the Summary API only lets you look up one page per web request, but internally (thanks to some magic in the RESTBase routing logic) it's effectively doing batch lookups. Since we need up to 100 summaries per Reading Lists API response, this separate Summary API would probably need an external batch mode.

daniel renamed this task from Re-implement reading lists REST interface in PHP to Re-implement reading lists REST interface outside RESTbase.Sep 25 2023, 4:31 PM

testing of the RESTBase endpoint /data/lists/{id}/entries does not show these additional fields actually being included in the response, although review of the code suggests they should be there. The reasons for this are currently unclear.

Interestingly, the APP shows the Wikidata "description" for each entry, as exposed by the "description" action api module in the Wikibase extension. I can't find a call to that in the readlinglists code in RESTbase, so I assume it's done client side.

lists are stored in a central DB so you get the same list no matter which wiki you query it on.

I haven't previously encountered use of a central DB (or at least realized that I had encountered it). So for the benefit of anyone else following along who is also unfamiliar with this, here's what I learned:

I see this in InitialiseSettings.php:

'wgReadingListsCluster' => [
	'default' => 'extension1',
],
'wgReadingListsDatabase' => [
	'default' => 'wikishared',
],
'wgReadingListsCentralWiki' => [
	'default' => 'metawiki',
],

I also see an Extension Storage section on this page, which says in part:

The x1 cluster is used by MediaWiki at WMF for databases that are "global" or "cross-wiki" in nature, and are typically associated with a MediaWiki extension.

I also found instructions for connecting to this cluster via the command line:

$ sql wikishared
# Connected to x1.wikishared

Putting all that together, it seems we have a set of database servers (the x1 cluster, aka extension1) that contain (at least) a database (wikishared), dedicated to storing cross-wiki information, mostly on behalf of MW extensions.

The Reading Lists feature uses this cluster/database to store the reading lists, meaning that we can efficiently access these tables from the Reading Lists extension, regardless of which project the associated endpoint is invoked on. In other words, there is no practical difference between hitting one of the Reading List extension endpoints via enwiki vs metawiki vs any other wiki. In all cases, the extension uses a connection to the same x1.wikishared database, and accesses the same set of tables on that database.

On a local development wiki, or on a small third-party wiki, the Reading Lists extension would instead likely use the default database (which in a single-wiki setup is probably the only database). And for a small wiki farm that works just fine, because in most cases traffic is low enough to not need the same separate-cluster setup as we use in WMF's infrastructure. Third parties could, of course, set up an infrastructure similar to ours if they needed to.

For purposes of this task (and repeating what Tgr already said using different words), the fact that Reading Lists are cross-project does not, on its own, compel us to use a separate service. We can already efficiently access the Reading Lists db data via an Action API or MW REST API endpoint. The only thing we're aware of that might push us to a separate service is the (possible) need for hydration.

Those who are familiar with this, please correct anything I got wrong.

I can confirm that apps no longer require hydration of the summary piece in the reading lists responses. The iOS app never used it at all, and the Android app used it as an optimization (because it was there) to display the description and thumbnail of reading list items slightly sooner, before they get populated by our background service that saves the items for offline use. (We have a task T346739 for completely removing it from our code, and there shouldn't be any difficulty, since it's already working without it.)

Interestingly, the APP shows the Wikidata "description" for each entry, as exposed by the "description" action api module in the Wikibase extension. I can't find a call to that in the readlinglists code in RESTbase, so I assume it's done client side.

IIRC hydrateSummaries returns the entire summary endpoint data (which includes the Wikidata description and a ton of other things). I think I made things confusing by referring to two different things as "summary" but really in the RESTBase parlance "summary" is the collection of various data returned by the summary endpoint, and the slightly reformatted first paragraph of the article is called the "extract" (as it's a reimplementation of TextExtracts).

'wgReadingListsCentralWiki' => [
	'default' => 'metawiki',
],

FWIW that one is not related to the central DB. To quote the variable description: Database name of the central wiki. This is unrelated to data storage (see ReadingListsDatabase for that) and only used to identify which wiki should be used for jobs and such.

Also FWIW, the way central DBs are configured in extensions is not great, and there's a discussion in T330590: External LBs should not be exposed to developers on how to support the use case better.

I reproduced @Tgr's observations on the missing hydration information, by creating a new reading list for myselfx, then getting my list's entries (all via the production API Sandboxes). I also see none of the hydration information. Then I spent way too long looking at lists.js. My suspicion based on behavior is that something is triggering the catch() on line 190. But I'm not sure it is worth the time to dig further, because it sounds like fixing that, or even understanding the reason it isn't working as expected, isn't required.

@Tgr wrote:

No idea what's going on here but if the hydration functionality really is broken, and the apps are doing fine, we should just drop it from the code and go with the simpler implementation as above

.And @Dbrant wrote:

I can confirm that apps no longer require hydration of the summary piece in the reading lists responses

Given the above two statements, are we comfortable reimplementing the lists endpoints currently in RESTBase via the MediaWiki REST API (aka rest.php)? If not what doubts do people have and/or what should we look at to remove uncertainty?

Looking a bit more at the implementation, it seems like most of the RESTBase "lists" endpoints just forward to Action API Endpoints.

For example, on one of the simpler ones (setup), I see this RESTBase endpoint:

https://en.wikipedia.org/api/rest_v1/#/Reading%20lists/post_data_lists_setup

forwarding to this Action API endpoint (implemented via the Reading Lists Extension):

https://www.mediawiki.org/wiki/Extension:ReadingLists#readinglists+setup

This appears to be controlled via the RESTBase configuration:

https://gerrit.wikimedia.org/g/mediawiki/services/restbase/+/c8915660aa2025d1695db797088b932b4c1d6215/v1/lists.yaml#72

using the x-request-handler feature of Hyperswitch:

https://www.mediawiki.org/wiki/HyperSwitch

It looks to me like most of the endpoints follow this pattern, albeit with additional parameters. The exception seems to be the GET operation for /lists/{id}/entries/, which has operationId: getListEntries that invokes getListEntries() within RESTBase.

Considering how we'd reimplement this, because RESTBase is a separate service, it actually has to make a call to the Action API over the network. Because the MediaWiki REST API is implemented alongside Action API within MediaWiki itself, I'm thinking we can skip that extra network call and just invoke the Action API module directly. We already have ActionModuleBasedHandler, so it seems like we could build on that approach. (I think Daniel basically said that in the task description, I'm just rephrasing it to confirm I understand.)

I'm also assuming we'd add any new MediaWiki REST API endpoints associated with Reading Lists to the Reading Lists extension.

I haven't looked at much RESTBase code. Am I understanding all that correctly? Is there anything in the above that people disagree with or would like to elaborate on?

That's correct as far as I can remember.

You could also just reimplement the Action API endpoints in REST, they don't do much work. The lists and list entries GET endpoints are somewhat complicated because of continuation and use as a generator, but that's going to be different for the REST API anyway.

The batch endpoints are especially a horrible hack and could be done much more cleanly in REST. Although someone would have to finally write the JSON body validator.

ecause the MediaWiki REST API is implemented alongside Action API within MediaWiki itself, I'm thinking we can skip that extra network call and just invoke the Action API module directly. We already have ActionModuleBasedHandler, so it seems like we could build on that approach. (I think Daniel basically said that in the task description, I'm just rephrasing it to confirm I understand.)

That's an option, though it's a little messy with all the mapping of paramreters, errors, and the response structure. I'd generally prefer to factor the business logic out of the API modules, and call that logc from both the API module and the REST handler. That would be a bit of an exercise in refactoring, but the code looks pretty nice and clean, so it should not be a problem.

I also note that the API modules are tagged as "internal". We don't need to keep them around for the benefit of third parties. As far as I known, nothing uses the these action API endpoints directly, the apps go through RESTbase. Turnilo reports 0 hits for action=readinglists (though it would miss POST requests).

So we could simply copy the API modules, madify them so they become REST handlers, and then delete the action API modules when they are no longer used.

I like the suggestions about replacing the Action API modules. At a high level, it'd probably go something like:

  • review the ReadingListRepository class to see if there's anything we anticipate needing/changing to make the steps below go more smoothly
    • this would also be a good time to review its tests, to see if we notice anything we should add. At a glance, they look good but it never hurts to check.
    • if we make any changes at this stage, release them so that we don't inadvertently build on bugs
  • consider whether there is "business logic" in the Action API modules that could be factored out and reused between the Action API modules and the REST API handlers.
    • if yes, create whatever is helpful (including tests for any new/modified code) and refactor the Action API modules to use it
    • if we make any changes at this stage, release them so that we don't inadvertently build on bugs
  • create REST API handlers corresponding to the RESTBase endpoints, using ReadingListRepository (plus any new classes we may have created) to do most of the work
    • add/refine any pieces of the REST API structure we find we need. For example, JSON body validator.
    • because these endpoints should be call compatible with the existing RESTBase endpoints, we should be able to run an external test suite against the live RESTBase endpoints and the new live REST API endpoints before rerouting traffic, to ensure we haven't missed anything. Does such a test suite exist? If not, we should create one.
    • we can hopefully run any such test suite against the new endpoints before taking them live, but we'll have to work out what data store they execute against.
  • reroute traffic from the RESTBase endpoints to the new REST API endpoints
    • Hopefully we didn't break anything, but if we did, fix it.
    • In an emergency, we could reroute back to the RESTBase endpoints
  • Wait a bit, just to make sure there are no surprises. As this is a relatively low traffic feature, bugs in edge cases won't necessarily show up immediately.
  • Remove the Action API endpoints from the Reading Lists extension

What did I forget? I'm assuming we'll create call-compatible replacements so that there is no impact on callers. An alternative would be to rethink the contracts current offered by the RESTBase endpoints. But then we turn this into a larger project requiring coordination across multiple teams, rather than just a RESTBase retirement effort.

I'm about to create a number of subtasks to help API Platform divide up this work. To help manage notification spam, I will not subscribe everyone from this parent task to the subtasks. Feel free to subscribe to any of them you choose.