Page MenuHomePhabricator

Create and enforce a limit for number of Reading Lists and Reading Lists entries for each user
Closed, ResolvedPublic

Description

In order to make sure that we do not overload the new extension and database tables, lets set some sane limits on the number of items users can store.

For the first pass, lets set the limits to:
100 lists
1000 items per list

After the first client starts beta testing, we can adjust as needed.

Event Timeline

Following up on a conversation in the RI meeting today:

@Tgr let me know that there is discussion on T174126: Security review for the ReadingLists extension about the limits and wanted to reduce the number of items per list from 100 to 1000. This seems low to me for the use case.

@JMinor can you chime in here with a product view on this limit?

The relevant recommendations by @jcrespo back from the RfC discussion:

This has to be massively reduced in scope -you can argue the details, this is just an example that would make sense to me: think 10 lists per user, 100 items per list, functionality disabled by default (opt-in), storage purged if not used after X number of weeks/months (whatever covers 99% of the current uses). This doesn't mean that cannot grow in the future, but I would start very, very small given the resource restrictions

(T164990#3264314)

Promise you will not use more than 10-20GB in total with the current resources and you will get me happy.

(T164990#3267805)

I apologize for losing track of that. The second is IMO not problematic event with iOS and desktop included (see T164990#3262187). The first seems pretty harsh, especially lists per user. I wonder if there is some other way to limit DB load? Like limiting paging to 10 lists / 100 entries per request?

(Note that these restrictions are until T177462 - ie. somewhere between half a year and one year.)

Sure, I mentioned paging somewhere as a way to go over the limits (paging here means technically they are multiple lists, but it is transparent for the user).

@jcrespo sorry didn't see this last response here…

So just to be clear: if we want to have larger lists, that is ok as long as:

  1. We page the responses at 100 items at a time so the DB is not overloaded
  2. Promise to stay below the 10-20 GB size threshold

Is this correct? Thanks again

I know this is a stupid limit, but honestly, we are starting with -1 pieces of hardware :-D ! A different conversation will be when proper dedicated (or shared, but separate) hardware is used, specially given that at the start there will be lots of writes. Having a small, but configurable "chunk" will help easing the limited resources. I personally do not care how large are the lists from the users perspective, as long as they are never accessed in larger chunks from the perspective of the database (which is different from how the user actually sees them).

An example of a failure is the watchlist issue- designed to have no more than a few hundreds of items, now they have single users with hundreds of thousands creating all kind of performance issues because it must be read on a single pass- and that is with dedicated resources (special large watchlist replicas), imagine that on smaller, non-dedicated resources.

There are three issues with paging that cannot be handled trivially by changing API result size limits:

  • paging is currently offset-based so if doesn't really reduce the DB resultset size. I've already committed to fix that before moving to production, it was just a hack to get things out for beta testing quickly.
  • getting all lists / all entries of a list is a reading_list + reading_list_sortkey (reading_list_entry + reading_list_entry_sortkey) query where the first table is used for filtering (user/list + deleted flag), the second for sorting. That means a filesort, so again paging probably won't help much. Some options:
    • go back on T164808 and put the sortkeys in the same table
    • just drop sorting of results altogether (it wasn't in the specification since this is used for sync; just seemed like a nice thing to do).
  • the get/set order endpoints are not paged (and have the same problems as above). Some options:
    • go back on T164808 and put the sortkeys in the same table, then page (con: updating would mean a thousand UPDATE queries; paging updates seems a little scary)
    • just use a serialized string for sortkeys, treat it as a client thing that the service just treats as a piece of metadata without any knowledge that it is related to sorting (pro: order reads/writes become trivial single-row queries; con: will increase the size of list rows, validation/keeping in sync with changes becomes difficult, probably not very helpful for desktop (but by that time we should have more servers))
    • denormalize data so that no joins are needed (copy user into reading_list_sortkey; copy list id into reading_list_entry_sortkey; either copy deleted flags or just hard-delete the sortkeys). These would make rows small (3-4 ints) so presumably operating on a thousand rows at once wouldn't be a problem.

Hello, since this involves the syncing of existing lists, we can already provide some evidence based guidance for reasonable use limits.

Average number of pages per list is ~25 on Android, so most users would be fine. However there are a not insignificant number of users (2.5% of all reading list users) who have more than 100 articles. But there are only a ~160 users with more than 1000. This is why we supported 1000 as the reasonable cap. 100 is a pretty drastic cut.

Change 389411 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/ReadingLists@master] Add list number and list size limits

https://gerrit.wikimedia.org/r/389411

Change 389411 merged by jenkins-bot:
[mediawiki/extensions/ReadingLists@master] Add list number and list size limits

https://gerrit.wikimedia.org/r/389411

This now works as stated in the task description, the problems in T168984#3662141 will go away with T180402 (except for the offset-based paging, which is a simple fix), so marking as resolved. The stricter limits that @jcrespo asked for will have to be implemented via paging.

@jcrespo checking the list size has been implemented with a straightforward SELECT COUNT(*) (with a covering index, and counts guaranteed to be <1000), which will be the only query where the more aggressive result size limits you mentioned in T164990#3264314 are impossible to implement. Any concerns about that? I know InnoDB is not very performant with counts, but here the resultset is small and won't have many pending transactions (due to only list owners being allowed to read/write list data), so I assumed that should not be a problem.

(For reading_list_entry the <1000 resultset is not actually guaranteed, since the index does not include the deleted flag, and we only limit the number of non-deleted items on the list. That will be fixed in the next batch of schema changes.)

SELECT COUNT(*)

I cannot say, use the handler counters to check how many rows are accessed- if half the table, that is really bad, if 1000 or 2000 or around, your queries are efficient and well indexed.