RfC: Reading List service
Closed, ResolvedPublic

Description

Overview

Reading wants a service for supporting reading lists (lists of wiki articles) on the web interface, to synchronize list data between devices and to provide functionality that requires a backend. This is a Q4 goal for Reading Infrastructure. The primary customers of the project are the iOS, Android and Reading Web teams.

Reading Lists will be a REST service that provides CRUD operations for lists and list items (page titles), the ability to sort lists in arbitrary ways, and support for efficiently syncing list changes. (At a later stage, support for searching the content of the listed pages and for push notifications is also planned. That is not included in this RfC.) Lists are only available to authenticated users and they are private data. These lists are NOT wiki specific and might contain pages from any number of wikis.

The goal of this RfC is to decide on the MediaWiki/RESTBase issue and make sure there are no objections to the high-level implementation plan.

Infrastructure

(An alternative infrastructure proposal has been removed from here after the decision made, and is preserved in T164990#3350956.)
Consumers want a REST API and adding one to MediaWiki would be a prohibitively large project. Writing an external service that can interface with authentication data directly would be likewise too complicated. Thus, the service will have to involve parts in MediaWiki and RESTBase both.
The service is implemented as action API module(s) in a MediaWiki extension. Summaries are fetched from RESTBase. Requests are proxied through a minimal RESTBase-based service to add REST semantics and versioning.
(Alternatively, fetching the summaries might be handled in the RESTBase part if that turns out to be significantly more performant.)

Pros:

  • Can reuse existing MediaWiki logic and infrastructure (auth, DB handling, API helpers etc)
  • Can take advantage of extra action API functionality if needed (batching, generators, OAuth etc)
  • All the elements (creating a MediaWiki extension, writing action API modules, creating a RESTBase service for wrangling MediaWiki API calls) are well understood, achievable, and easy to estimate.
  • Can be more easily reused by third parties (if they don’t need REST format)

Cons:

  • Two codebases to maintain

API syntax

Opt in/out

Users must opt in explicitly before they can use the API, to spare storage space (of default lists) and bandwidth (of push notifications). Users can opt out again if they want to delete their data.

  • POST /lists/reading/setup: enables reading lists for the user (creates default list, sets up push notifications). A prerequisite for everything else.
  • POST /lists/reading/teardown: disables reading lists, deletes data

List CRUD

A list consists of an id, a name, a creation date, some metadata (description, image etc), and an array of entries.

  • GET /lists/reading/: get all lists for the user.
  • POST /lists/reading: creates a new list
  • PUT /lists/reading/{list_id}: updates a list
  • DELETE /lists/reading/{list_id}: deletes a list

List entry CRUD

An entry consists of an id, a project (wiki domain), a title and a creation date. The summary of the page (as given by the summary API) is also included.

  • GET /lists/reading/{list_id}/entries: gets all entries.
  • POST /lists/reading/{list_id}/entries: adds a new entry to the list
  • DELETE /lists/reading/{list_id}/entries/{entry_id}: deletes an entry

Sorting

Sorting works by getting/setting an array of all list / list entry ids. The number of lists per user / entries per list will be capped at some high but reasonable number so these arrays cannot grow too long.

  • GET /lists/reading/list_order: get order of lists
  • PUT /lists/reading/list_order: set order of lists
  • GET /lists/reading/{list_id}/entry_order: get order of list entries
  • PUT /lists/reading/{list_id}/entry_order: set order of list entries

Misc

  • GET /lists/reading/pages/{project}/{title}: gets which lists a the given page is a member of
  • GET /lists/reading/changes/since/{date}: gets all lists and entries which changed since the given date (for device sync). Returns a marker for deleted lists/entries as well.

Action API syntax

(in case option 1 is chosen above)

  • list=readinglists to get all lists of the current user
    • rlprop=entries to enable/disable returning the entries together with the list
    • rllimit, rlentrylimit
    • list=readinglists&rllist={list_id} to get the entries of a single list
    • list=readinglists&rlproject={project}&rltitle={title} to get the lists containing a given page
  • list=readinglistchanges&rlcsince={date} to get changed/deleted entries/lists
  • action=readinglists&command=[setup|teardown|create|update|delete|createentry|deleteentry|order] for all the write operations on lists. (This is slightly against action API conventions which would put all "commands" as separate actions, which IMO would make the help interfaces unnecessarily hard to use.)
  • action=readinglistentries&command=[create|delete|order] for all the write operations on list entries.

Data storage

Reading lists contain primary data (cannot be regenerated from other sources and losing it would have a major UX impact), and data needs to be fetched based on criteria other than the id (e.g. all lists containing a given page, all entries which have changed after a given date) so MariaDB will be used:

reading_list table:

  • rl_id
  • rl_user_id: central ID of the user
  • rl_is_default: flag to tell apart the initial list from the rest, for UX purposes and to forbid deleting it
  • rl_name: human-readable name, non-unique
  • rl_description
  • various other metadata: color, image, icon (TBD: should these be in a separate key-value table to make adding/removing metadata types easier?)
  • rl_date_created
  • rl_date_updated
  • rl_deleted (we need soft-delete for sync)
  • indexes: (rl_user_id, rl_date_updated) for the /since/ route, (rl_user_id, rl_deleted) for getting all

reading_list_entry table:

  • rle_id
  • rle_list_id
  • rl_user_id: central ID of the user (denormalized for the benefit of the /pages/ route)
  • rle_project: wiki project domain (TBD: use a lookup table / some other way of compression?)
  • rle_title: page title (can't easily use page ids due to the cross-wiki nature of the project. Also, page ids don't age well when content is deleted/moved.)
  • rle_date_created
  • rle_date_updated
  • rle_deleted
  • indexes: (rle_list_id, rle_date_updated) for the /since/ route and all entries in a list, (rle_user_id, rle_project, rle_title) for the /pages/ route

reading_list_sort_index table:

  • rlsi_rl_id
  • rlsi_index
  • indexes: (rlsi_index)

Deleted lists / entries will be purged by a periodic job when they are older than X days (so that other devices have time to sync the deletion).

TBD: are all the indexes worth it? We don't expect a single user to have lots of lists / a single list lots of items

Schedule and cross-dependencies

Q4: set up an MVP on the beta cluster.
Dependencies: Services and Tech Ops for consulting, Security for review,

Q1: integrate with apps, deploy to production. Add download sizes (will be handled by the page summary API).
Dependencies: Android/iOS for testing and iterating, Services for download sizes.

Q2: improve app integration, add search.
Dependencies: Discovery (Backend) for figuring out search (possibly on multiple wikis, in a limited set of pages). Android/iOS for continued iteration.

Usage projections

Android has a similar feature (except you don't need to be logged in), used by 10% of users. That results in 1500 list write operations per hour (the peak is around 2500). There are 6M Android users a month, and <1M iOS users. There are about 300K active logged-in web users a month on enwiki (counting user_touched - not very reliable since we have extended login durations to a year), and the churn for that group is nearly 100%. Enwiki tends to be about half of everything.
Put together, that suggests somewhere around 0.1 writes/sec (maybe add a magnitude to that in case throngs of readers register to use the new feature - this will be the first time we heavily promote registration to readers).

Syncing is done via a push model, so read volume should be below write volume.
TBD - apps store lists locally and will only use the API occasionally, for syncing. Can web do that? Otherwise, we might have way more reads.

Average number of pages per user is ~25 on Android, so assuming usage level is same across all device types and login rate does not grow exponentially, we can expect ~2.5M rows to be actively used at one time. Assuming 100% user churn and 200 byte per row that's 6GB storage space every year. (Keep in mind this is a very handwavy guesstimate as we have no way to know what fraction of users will actually use lists / how many readers will register an account just because of this.)

All of these estimates refer to Q4 2017-18, which is when the feature is enabled on the web if all goes according to plan. Before then, it will only be used by apps which will probably result in significantly smaller usage.

See also

For previous conversations and far more detail see the Technical Plan, T164805, T164808, T164236.
The (very different) former proposal for reading lists can be found at T128602: Create and deploy an extension that implements an authenticated key-value store..

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
Tgr renamed this task from [DRAFT] RfC: Reading List service to RfC: Reading List service.May 16 2017, 11:27 PM
Tgr updated the task description. (Show Details)
bearND added a comment.Jun 1 2017, 6:59 PM

In light of T166734 do we want to change the schema to allow for lists of lists?

@bearND hashing out on the other ticket

bearND added a comment.Jun 1 2017, 8:05 PM

@Fjalapeno Sounds good to me. Thanks!

The task is ready for comments. Please if you have any feedback questions, you can provide them now.

Thanks!

FWIW

they are private data

Why? (Can be solved though, see below.) Avoid user input if possible, to reduce hosting issues.

A list consists of an id, a name, a creation date, some metadata (description, image etc), and an array of entries.

Sounds easy to fit into a user subpage.

Reading lists contain primary data (cannot be regenerated from other sources and losing it would have a major UX impact)

Nice, same expectation as wiki pages.

This has to be integrated into mediawiki or other existing service
This has to live on x1 [...] or, alternatively, be split into per-wiki storage.

Meta-Wiki's user namespace is quite spacious and can happily host JSON blobs listing content from other wikis.

Authentication is the most interesting part, I think. It would be nice to be able to register background/shadow accounts so that the user has some stable identifier and space on the wiki, connected to some cookie or other token, even without a full-fledged identity (username, password and email). A bit similar to the guest/pseudo-unregistered accounts on StackExchange.

This comment was removed by Nemo_bis.
Fjalapeno added a comment.EditedJun 7 2017, 5:21 PM

Sounds easy to fit into a user subpage.

While it could be fit onto a sub page, using a database makes it easier to do queries for operations such as push notifications. Also if we want to do queries on the pages and understand usage - it is a way for us to find very popular pages (aside from page views).

Additionally, with queries we can deliver results to mobile clients more efficiently by querying the DB for just changes and only sending the change set to users. This is important for users with low bandwidth and intermittent connections.

One other thing is that we want to make these cross wiki lists, A user page is usually associated with a particular project (it could be put somewhere like meta, but that may not be best for discoverability).
(I just realized this point could be emphasized more strongly in the RFC above, so I will do that now)

Authentication is the most interesting part, I think.

Yeah we have a big emphasis on keeping these private - as we want users to use this feature and be sure that their usage and reading preferences are not visible publicly. By storing this in a DB table that can be joined against the central auth table, we can ensure that users data can only be accessed by the actual logged in user.

Fjalapeno updated the task description. (Show Details)Jun 7 2017, 5:22 PM

@Nemo_bis forgot to ping you above with comments ^

Jdlrobson added a comment.EditedJun 7 2017, 5:50 PM

There are legitimate uses of private data for a reading list. I don't particularly want to share with people what I am reading without any context. That's my business. (Edit: also if we surface reader's public reading lists we are basically publishing information about their interests to marketers ... not cool!)

But yes if we want to make lists public a user namespace would be fine but at the expense of many features e.g. show me every list which features the page Barack Obama would not be possible

daniel moved this task from Inbox to Request IRC meeting on the TechCom-RfC board.Jun 7 2017, 8:22 PM
bearND updated the task description. (Show Details)Jun 8 2017, 4:24 PM

I saw the post on wikitech-l regarding this Rfc, but we haven't gotten much feedback there. I strongly suggest we create a subtask for this work to reach out to communities, particularly our more active communities, for some feedback. Early and frequent communications is a best practice for things like this. :)

If ya'll would like some liaison support on this, in the related task please use the following as a template to clearly define the scope of your request. This way the CL team can help prioritize work.

  • What is the problem?
  • How does success of this task look like? How do we know when we are done?
  • Is there any goal, program, project, team related with this request? Please provide links and use the corresponding Phabricator tags when available.
  • What is your expected timeline from start to end? Is there a hard deadline?
Tgr added a comment.Jun 14 2017, 7:01 PM

they are private data

Why?

Because that's the product specification this RfC is aiming to fulfill. This is not a good place to disagree with it. (I do not mean to shut down that line of discussion, just pointing out - somewhat belatedly - that this is a technology discussion that's not probably not being followed closely by the right people - neither researchers/UX designers who could explain what the user needs are, nor decisionmakers whom you could try to convince that the fundamental concept should be changed.)

Authentication is the most interesting part, I think. It would be nice to be able to register background/shadow accounts so that the user has some stable identifier and space on the wiki, connected to some cookie or other token, even without a full-fledged identity (username, password and email). A bit similar to the guest/pseudo-unregistered accounts on StackExchange.

This is not really related to reading lists, but if you are interested, there is a similar task: T133452: Create temporary accounts for anonymous editors

Tgr added a comment.Jun 14 2017, 7:21 PM

@CKoerner_WMF: outreach for technoogy RfCs is usually aimed at the developer community, wikitech-l is IMO the best place for that. Wider community outreach about use cases / product goals is nice (and hopefully already happened or has been planned) but it's product goals rather than implementation plans that's relevant to these communities.

GWicke added a comment.EditedJun 14 2017, 8:40 PM

This RFC was scheduled for today's IRC meeting. Daniel announced this on wikitech-l, but unfortunately none of us announced it here in time. I hope some of you can still make it to today's meeting on #wikimedia-office!

Tgr updated the task description. (Show Details)Jun 15 2017, 10:37 AM

The RfC was discussed int the IRC office hour E615: ArchCom RFC Meeting: Reading List service (2017-06-14, #wikimedia-office). Option 1 was accepted and the RfC entered Final Comment Period.

I'm removing Option 2 for ease of readability and preserving it below for posteriority:

We are considering two options (with a preference for option 1):

Option 1: a MediaWiki API module with a RESTBase proxy

The service is implemented as action API module(s) in a MediaWiki extension. Summaries are fetched from RESTBase. Requests are proxied through a minimal RESTBase-based service to add REST semantics and versioning.
(Alternatively, fetching the summaries might be handled in the RESTBase part if that turns out to be significantly more performant.)

Pros:

  • Can reuse existing MediaWiki logic and infrastructure (auth, DB handling, API helpers etc)
  • Can take advantage of extra action API functionality if needed (batching, generators, OAuth etc)
  • All the elements (creating a MediaWiki extension, writing action API modules, creating a RESTBase service for wrangling MediaWiki API calls) are well understood, achievable, and easy to estimate.
  • Can be more easily reused by third parties (if they don’t need REST format)

Cons:

  • Two codebases to maintain

Option 2: A RESTBase service using MediaWiki for authentication

The service is implemented in RESTBase. To authenticate requests, it makes a call to the globaluserinfo API in MediaWiki and copies all cookies from the client request / copies Set-Cookie headers >back to the response.

Pros:

  • Probably more performant (the MW API setup costs will be incurred anyway because of the authentication needs, but other work could be done in parallel)
  • Might take advantage of features that are not available in MediaWiki due to legacy limitations (e.g. MySQL clustering)

Cons:

  • Not a good match to Reading Infrastructure staffing (@Tgr has little experience with Node)
Tgr updated the task description. (Show Details)Jun 15 2017, 10:39 AM
Anomie added a subscriber: Anomie.Jun 15 2017, 2:21 PM

Since @Tgr asked me to look at the Action API part of this, here's what I see at first read-through of that section (and the DB schema section).

  • list=readinglists to get all lists of the current user
    • rlprop=entries to enable/disable returning the entries together with the list
    • rllimit, rlentrylimit
    • list=readinglists&rllist={list_id} to get the entries of a single list
    • list=readinglists&rlproject={project}&rltitle={title} to get the lists containing a given page

This seems like you're having one module trying to do too many things, like the now-deprecated list=deletedrevs. I'd recommend instead:

  • meta=readinglists to return the user's lists, but not the entries.
    • rlproject and rltitle would be here, as described.
    • It would only need one rllimit parameter.
    • rlprop would be limited to data from the reading_list database table.
  • list=readinglistentries to return the entries in one or more lists.
    • rlelist would be a required parameter. You could easily make it multi-valued.
    • It would only need one rlelimit parameter.
    • rleprop would control what data about each entry is returned, mainly from the reading_list_entry table.
    • It could be used as a generator. Note if multiple lists are specified for grlelist there would be no way to tell which list contained each generated title; I mention this to set expectations.

Let's also consider the database interaction here.

  • I assume reading_list.rl_id and reading_list_entry.rle_id are the PKs, and we're relying on InnoDB's implicit appending of the PK as we do in core.
  • Your proposal without rlprop=entry, and my proposal for meta=readinglists, both without the rlproject or rltitle filters, would seem to make a query like SELECT ... FROM reading_lists WHERE rl_user_id = ? AND rl_deleted = 0 ORDER BY rl_id. That seems well-supported by the indexes provided.
  • Both proposals with the rlproject or rltitle filters would seem to make a query like SELECT ... FROM reading_list_entry JOIN reading_lists ON (rle_list_id = rl_id) WHERE rle_user_id = ? AND rle_deleted = 0 AND rle_project = ? AND rle_title = ? AND rl_deleted = 0 ORDER BY rl_id.
    • The (rle_user_id, rle_project, rle_title) index would mostly work for that, although it'd fall over like T97797 if someone has a lot of entries for the project+title with rle_deleted != 0 or rl_deleted != 0.
    • A (rle_user_id, rle_deleted,rle_project, rle_title,rle_list_id) index would fit the query a bit better. In that case, you'd also change it to ORDER BY rle_list_id.
    • For that matter, an index on (rl_deleted) (or (rl_deleted,rl_id) if you don't want to rely in implicit appending) would fit the JOIN here better.
  • Your proposal with rlprop=entries would seem to make a query like SELECT ... FROM reading_lists JOIN reading_list_entry ON (rle_list_id = rl_id) WHERE rl_user_id = ? AND rl_deleted = 0 AND rle_deleted = 0 ORDER BY rl_id, something.
    • If "something" is rle_date_updated, rle_id, it has an index to mostly support it. Same potential T97797 issue though. If "something" is anything else, it seems to be lacking useful indexes.
  • My list=readinglistentries would make a query like SELECT ... FROM reading_list_entry WHERE rle_list_id IN (?) AND rle_deleted = 0 ORDER BY rle_list_id, something.
    • Same comment as the previous bullet regarding indexes.
    • It'd also do a SELECT rl_id FROM reading_list WHERE rl_user_id = ? AND rl_deleted = 0 AND rl_id IN (?) beforehand to validate the input list IDs and issue warnings for any that don't exist.

I see you have a reading_list_sort_index table in there too, but I have no idea what you might be intending to use it for. Any likely query trying to use it would probably filesort, or would have to fetch and filter a huge number of rows.

  • list=readinglistchanges&rlcsince={date} to get changed/deleted entries/lists

Similar action API modules would have rlcstart, rlcend, and rlcdir parameters rather than a single rlcsince. But your DB schema doesn't really support that since it doesn't have a "recentchanges"-like table.

I'm not sure what queries you're intending here, so it's hard to say how well they match up with the schema. My best guess is that you'd do two queries:

  • SELECT ... FROM reading_list WHERE rl_user_id = ? AND rl_date_updated >= ? ORDER BY rl_date_updated, rl_id to get the changed lists and their metadata.
  • Then, using the IDs from the previous query, SELECT ... FROM reading_list_entry WHERE rle_list_id IN (?) AND rle_date_updated >= ? ORDER BY rle_list_id, rle_date_updated, rle_id to fetch the changed entries.

That seems supported by the indexes.

  • action=readinglists&command=[setup|teardown|create|update|delete|createentry|deleteentry|order] for all the write operations on lists. (This is slightly against action API conventions which would put all "commands" as separate actions, which IMO would make the help interfaces unnecessarily hard to use.)

It's not a bad plan if you make your command be submodules and use ApiModuleManager, then the help should work out logically. Existing examples to look at:

  • ApiMain's use of it for action. Unfortunately ApiMain has a lot of extra code you'd need to ignore since it's the main entry point.
  • ApiQuery's use of it for prop, list, or meta. The code here is complicated by these all being multi-valued parameters, and ApiQueryBase is complicated by a lot of methods for building SQL queries.
  • ApiFlow's use for submodule is most like what you'd be doing here, but it's complicated by having query-type submodules in addition to edit-type submodules so it has a bit of a mess in ApiFlowBasePost and ApiFlowBaseGet.

BTW, there's nothing saying you'd have to have a ApiReadingListBase class, although it can be helpful for type checking and for any commonalities in the commands.

"setup" and "teardown" seem like odd commands though. Why do these exist, instead of things being automagically set up for a user to use reading lists?

  • action=readinglistentries&command=[create|delete|order] for all the write operations on list entries.

This seems redundant to action=readinglists&command=[createentry|deleteentry|order].

I saw the post on wikitech-l regarding this Rfc, but we haven't gotten much feedback there. I strongly suggest we create a subtask for this work to reach out to communities, particularly our more active communities, for some feedback. Early and frequent communications is a best practice for things like this. :)

If ya'll would like some liaison support on this, in the related task please use the following as a template to clearly define the scope of your request. This way the CL team can help prioritize work.

  • What is the problem?
  • How does success of this task look like? How do we know when we are done?
  • Is there any goal, program, project, team related with this request? Please provide links and use the corresponding Phabricator tags when available.

@Ckoerner thanks for the offer of support! As @Tgr mentioned the product side, specifically the Android team would probably most benefit from your assistance in fostering conversation with the community. I'll set up a task for that! (Also note: there is a lot of discussion around Reading Lists on the Readers team as a whole - but Android is the first client to be moving forward with implementing sync)

  • What is your expected timeline from start to end? Is there a hard deadline?

We will be starting development very soon - expect the Android app to start beta testing next quarter. The timeline for rolling out to production will be set based on that.

daniel moved this task from Request IRC meeting to Last Call on the TechCom-RfC board.
daniel added a subscriber: daniel.

This RFC has entered the Last Call period. If no new and pertinent concerns remain unaddressed by July 5, this RFC will be approved for implementation.

@jcrespo Any final requests regarding the DB schema?

Any final requests regarding the DB schema?

Not really, as long as the https://www.mediawiki.org/wiki/Development_policy#Database_policy is followed. The main issue here is that there is currently no resources assigned for this, so implementation will be held to the higher of standards in terms of performance expected.

On July 5th, this RFC has been approved for implementation after passing the a Last Call period without any pertinent issues being raised.

OUt of curiosity: what's the reasoning behind using namespace+title to refer to pages, instead of the page ID? IS this a product level decision? Both approaches have advantages, the behavior differs when pages are moved/renamed.

@daniel we went back and forth on this a bit. I originally proposed ids, but in the end we went with titles… basically it came down to a few issues:

  1. page ids are not unique across wikis. So we would still need 2 properties.
  2. page ids, like titles, can change (when a page is deleted and re-added).

So basically, it is just "pick your poison". And in that case, it seems going with convention is the better option.

I personally would love to go with IDs for this and all features. But I think before it makes sense we should to come up with a way to give unique ids to all pages across all WMF projects that don't change when moved, deleted, re-added.

Tgr added a comment.Jul 10 2017, 6:13 PM

ReadingLists are meant to be cross-domain (pages from multiple wikis in the same list) so the API wouldn't be able to turn page IDs into titles. We could generate URLs from the page IDs directly but the UX for that is not so good currently (see T73578 for example).

It is my intention to, at some point, to convert titles into first-class entities, give them some ids and that way reduce our storage needs and query latency (specially for *link tables). Not happening soon, though.

@daniel we went back and forth on this a bit. I originally proposed ids, but in the end we went with titles… basically it came down to a few issues:

  1. page ids are not unique across wikis. So we would still need 2 properties.

Since both are per-wiki, the cross-wiki nature is not an argument for or against IDs.

  1. page ids, like titles, can change (when a page is deleted and re-added).

As far as I know, this is no longer true. I think it was fixed a year or two ago. But even it if was true, it happens orders of magnitude less often than page moves.

I personally would love to go with IDs for this and all features. But I think before it makes sense we should to come up with a way to give unique ids to all pages across all WMF projects that don't change when moved, deleted, re-added.

It's important to note that the two options are not semantically equivalent. They cause different behavior when pages are moved. Whether IDs or titles are used should be decided based on which behavior is desired. Technical considerations should be secondary to that decision (though of course they cannot be ignored).

Tgr added a comment.Jul 11 2017, 1:05 PM

ID change on page undeletion was fixed a while ago (T28123). Deletion and normal recreation still changes the ID (and attempts to use the old ID yield a badtitle error). Revision undeletion can also mess with the ID, although that's probably too fringe to care about.

Pages moves only result in different behavior if the page is moved without leaving a redirect, or moved multiple times without fixing all redirects, or the redirect gets edited away; those are also rare, and the correct behavior depends on user intent and is impossible to guess.

In any case, I think all those issues are minor compared to the fact that we don't support page ID based retrieval very well (nothing in the RESTBase ecosystem supports it, for example).

@Tgr thanks… I forgot to mention the summary lookup in Cassandra which doesn't support ids

Tgr added a comment.Jul 20 2017, 9:55 PM

@Anomie: thanks for the review! Sorry for taking so long to get back, got distracted by another project.

! In T164990#3351439, @Anomie wrote:
This seems like you're having one module trying to do too many things, like the now-deprecated list=deletedrevs.

Mostly I just wanted to be able to get both lists and entries in a single request to speed things up. But I guess it's too early for trying to optimize things, and using list entries as generators is certainly a nice idea (even if less useful on wikifarms since lists can include pages from other wikis).

  • The (rle_user_id, rle_project, rle_title) index would mostly work for that, although it'd fall over like T97797 if someone has a lot of entries for the project+title with rle_deleted != 0 or rl_deleted != 0.
  • A (rle_user_id, rle_deleted,rle_project, rle_title,rle_list_id) index would fit the query a bit better. In that case, you'd also change it to ORDER BY rle_list_id.

We'll probably limit the number of lists / entries a single user to something like hundreds or thousands, and deleted rows will be hard-deleted after a certain time so T97797-type issues probably wouldn't come up, but you are right that there is no point in not including the deleted fields in the index when it exists for this single use case anyway.

  • Your proposal with rlprop=entries would seem to make a query like SELECT ... FROM reading_lists JOIN reading_list_entry ON (rle_list_id = rl_id) WHERE rl_user_id = ? AND rl_deleted = 0 AND rle_deleted = 0 ORDER BY rl_id, something.
    • If "something" is rle_date_updated, rle_id, it has an index to mostly support it. Same potential T97797 issue though. If "something" is anything else, it seems to be lacking useful indexes.

Probably just rle_id. We want deterministic ordering for continuation (although in practice we might limit the number of lists / items strictly enough that there will be no need), beyond that the client has no reason to care. This mode will be used for initial syncing of all data. Or maybe join+sort on reading_list_sort_index, for the benefit of web clients.

I see you have a reading_list_sort_index table in there too, but I have no idea what you might be intending to use it for. Any likely query trying to use it would probably filesort, or would have to fetch and filter a huge number of rows.

It's so that users can reorder tables. Since reordering typically affects lots of rows (e.g. the user drags the bottom list to the top -> all sortkeys change), putting them into a separate table and using mass delete / multi-row insert seemed less trouble than trying to do a multi-row update with different values for each row. The query will be like SELECT ... FROM reading_list JOIN reading_list_sortkey ON rl_id = rls_rl_id WHERE rl_user_id = ? ORDER BY rls_index. Since the number of lists/entries for a single user will probably be limited, filesorting is not a huge concern. If we do end up allowing lots of rows per user, we could denormalize rl_user_id into reading_list_sort_index (and rle_rl_id into the similar reading_list_entry_sort_index, which I forgot to include here but will also be needed), then the select could use a single index. But allowing many lists/entries would open up all kinds of other issues (e.g. apps with limited storage and memory would have to take precautions not to receive a huge number of lists) so limiting seems like less hassle.

  • list=readinglistchanges&rlcsince={date} to get changed/deleted entries/lists

Similar action API modules would have rlcstart, rlcend, and rlcdir parameters rather than a single rlcsince. But your DB schema doesn't really support that since it doesn't have a "recentchanges"-like table.

I'm not sure what queries you're intending here, so it's hard to say how well they match up with the schema. My best guess is that you'd do two queries:

  • SELECT ... FROM reading_list WHERE rl_user_id = ? AND rl_date_updated >= ? ORDER BY rl_date_updated, rl_id to get the changed lists and their metadata.
  • Then, using the IDs from the previous query, SELECT ... FROM reading_list_entry WHERE rle_list_id IN (?) AND rle_date_updated >= ? ORDER BY rle_list_id, rle_date_updated, rle_id to fetch the changed entries.

This is for syncing data that has changed between multiple devices. So it will be more like rle_list_id IN (?) OR rle_date_updated >= ?, and ordering doesn't really matter (beyond continuation support) - the client wants to get all changes it has not seen yet.

"setup" and "teardown" seem like odd commands though. Why do these exist, instead of things being automagically set up for a user to use reading lists?

One reason is that users should have a default list without having to explicitly create it, we don't want to spam the database by creating it for all users (or all users who check their lists), and having separate setup/teardown commands is apparently easier than pushing the burden of creating/deleting the default list on them. Also, teardown (ie. delete all data and disable the service for the given user) is needed for legal reasons.

  • action=readinglistentries&command=[create|delete|order] for all the write operations on list entries.

This seems redundant to action=readinglists&command=[createentry|deleteentry|order].

Yeah, that was a copyedit mistake. Initially I figured those three could be grouped together since they have somewhat similar parameters, but using a single readinglists with submodules seems nicer.

This is for syncing data that has changed between multiple devices. So it will be more like rle_list_id IN (?) OR rle_date_updated >= ?, and ordering doesn't really matter (beyond continuation support) - the client wants to get all changes it has not seen yet.

An OR like that probably can't have any good index support in MySQL/MariaDB. To avoid a scan over all rle rows by the user, you'd need to split it into two queries.

One reason is that users should have a default list without having to explicitly create it, we don't want to spam the database by creating it for all users (or all users who check their lists), and having separate setup/teardown commands is apparently easier than pushing the burden of creating/deleting the default list on them.

Setup when the user first accesses it would handle that, I'd think. But, meh.

Also, teardown (ie. delete all data and disable the service for the given user) is needed for legal reasons.

I'll have to take your word on that. But it could be named better than "teardown" in that case.

Tgr closed this task as Resolved.Aug 25 2017, 1:19 AM

Closing, nothing else left to do here. The OR issue from the previous comment was fixed, the naming and usefulness of setup/teardown is tracked in T171913: Fix technical debt in ReadingLists extension.

Restricted Application added a subscriber: jeblad. · View Herald TranscriptAug 25 2017, 1:19 AM