Page MenuHomePhabricator

Compare sortkeys vs. linked lists for Reading List API providing sorting features
Closed, ResolvedPublic

Description

The Reading List Service API will provide support for storing the order a list of items in the backend and allowing the user to update that. The updating operations are: move a given list item to a given position (ie. drag&drop sort) and sorting the whole list at once (e.g. alphabetically). This second one might or might not be needed.

In the discussion, two implementations were brought up:

  1. Use a sort key (such as an integer weight). Have MySQL order by that field. On single-item-update select a new sortkey accordingly (e.g. (weight of previous item + weight of next item) / 2). There will be some rare cases where that would not work (e.g. the previous & next item already have consecutive weights) in which case the sortkeys for the whole list need to be renormalized.
  2. Use linked lists. Always fetch all items of the list from the DB, sort and page them in the application logic. On single-item-update just update the links for the item and its old and new neighbours.

Event Timeline

Option 1 means touching one row in most updates, and all rows in some cases. Option 2 means touching three rows per update (can be done in a single query, although it will be a bit ugly).

Following the requirement that the newly added item needs to appear first, there would be no updates performed in option 2 (excluding the obvious INSERT that needs to happen no matter what). When reordering, there would need to be 3 updates, as pointed out above: the item preceding the item being moved before the reordering, the item preceding it after the reordering and the item being moved itself. If the item is being moved to the top of the list, then only 2 updates are needed.

Option 2 also means needing to pull the entire list into memory and then sorting in memory every time the list is requested. This can introduce additional overhead for long lists where the results will be paged, because if a user requests page 2 of the results, then we need to pull the entire list into memory, sort, and then pull out the 2nd page of results. The same must be done for page 3, page 4, etc…

Option 1 relies on MySQL for returning the list in order without the overhead of in memory sorting. And can be efficiently return a new page by using a cursor

Option 2 also means needing to pull the entire list into memory and then sorting in memory every time the list is requested. This can be problematic for long lists where the results will be paged, because if a user requests page 2 of the results, then we need to pull the entire list into memory, sort, and then pull out the 2nd page of results. The same must be done for page 3, page 4, etc…

I wouldn't really qualify sorting even thousands of entries as problematic, really. It can be done in a millisecond or so, so it's not comparable with the latencies for connecting to the server, etc. For comparison, the MW API has an approximate start-up time of tens of milliseconds regardless of the call being made (there is no such overhead for RESTBase, but that's beside the point here).

@mobrovac good point I updated the comment to remove "problematic" and put in my real intention which is just to show that there is extra work being performed in these situations.

I shouldn't have made a judgement on if that extra work actually has a meaningful negative impact… I'm not actually sure.

In memory/application, you use whatever you want and takes less space. On persistance the way to do this (not because it is a rule, it is just how millions of applications do it and unless you want to do a very special list implementation) is to have a "sort_index" column and you index that. When retreiving a list, you filter by a list id or similar, and order by that sort key, thus you add an index on (list_id, sort_index), so you can do SELECT col1, col2, ... FROM my_table WHERE list_id = 'my_list_id' ORDER BY sort_index [with optional LIMIT];, of course possibly containing extra joins, etc. sometimes sort + join misbehaves, and there are some things to do there, but that is a know issue with our mariadb version that already has a quick fix. For paging, if needed, you use the last sort_index used.

I understand the need for linked lists in memory, but references are technically not possible in a relational model, so it is not possible to do that (directly). Reading and writing the whole list would be preferred unless you have thousands or millions of entries, in which case there are both UI and database challenges (and I would like to see how you "solve" the UI issues in that case first)- and then we can try to work out less simple models. If I were to start optimizing, I would start from storing references to titles rather than strings themselves, but that is probably still out of scope. Then, the next optimizing step would be to continue using a pseudo-stack model (you delete from anywhere, but always insert physically at the bottom), but using a separate table for the indexing so you minimize writes/physical moves (you only write lists of integers each time). But I would encourage the simplest of model first-if something mediawiki size has told me is that the bottleneck will be on a different place than expected, so I would avoid premature optimization.

As usual, I do not have enough context to give an informed opinion, so I may be wrong here (I am assuming many things).

and I would like to see how you "solve" the UI issues in that case first

That was not a challenge- for example, one option to solve editing a large list of results is to page it on blocks when shown on the fronted- in that case, the database should follow the same model of editing in "blocks" at a time. For large batch editions like "delete all items" etc., an asynchronous job queue could be setup. Again, I am still speculating a lot here 0:-)

With the above, you may have seen I do not like much the "sort key" method you propose- but if we are going "dirty" and do that- we would put a float instead of an int so we would have infinity* number of values in-between [*where infinity = log2(10^23) aprox.] and it would fail "softly"- page uses a similar approach for the random ordering. The linked lists is a big no because you have 0(logn ~ 1) writes and O(n^2 to nlogn) reads, that is very good in memory, but it is really bad in the context mysql persistance or traditional storage systems (except maybe graph databases).

We actually have a float as one of the solutions in our doc:
https://docs.google.com/document/d/1dZcpGWi8cuuMmXduANYt37pZq6B8Z66_x9dMBNADTFU/edit#

But didn't bring it over.

If you have the time this may provide more context. And maybe help you with your recommendations.

Thanks for the input!

I just gave it a quick, look, so sorry if I am wrong here- you did an incredibly good analysis/design job there, but I miss more details about non-functional requirements and limitations- and it is on those details where the devil is. Lists like watchlist have been traditionally lacking in features and they have issues that the current model doesn't allow -or allow with problems- mostly size and editing limitation. If this is a new service, maybe you can impose such limits from the beginning based on user feedback (do people need more than 100 lists? 1000 lists? More than 100000 items in a single list? etc. Should we prioritize read performance or write performance?) and that will automatically tell you which way to go. Specifically for storage and performance reasons those decisions are incredibly important.

With the above, you may have seen I do not like much the "sort key" method you propose- but if we are going "dirty" and do that- we would put a float instead of an int so we would have infinity* number of values in-between [*where infinity = log2(10^23) aprox.] and it would fail "softly"- page uses a similar approach for the random ordering.

That would fail quietly (log2(10^23) ~= 76 so it's not just something that we can assume will never happen) and result in erratic behavior that is really hard to debug. CPU time is cheaper than developer time. And in any case a float takes up the same space as an int, you don't really win anything and lose ease of understanding.

@jcrespo Marko actually suggested a sane upper limit of reading lists in the hundreds (maybe 500). We haven't discussed a limit in list size, but that could be added as well if you have a number you think is appropriate.

And yeah, some details like that are still being added… please feel free to add specific comments to the document as well.

As far as prioritizing read or write… I'm thinking that reads will be more common (especially when getting the list of lists), so I think that that would be the most important. But I will defer on @Tgr on most of the implementation details.

With the above, you may have seen I do not like much the "sort key" method you propose

How is your recommendation in T164808#3248466 different from the sort keys? Are you suggesting to use consecutive numbers for the index field, and update all list items when the user reorders something?

Would that involve lots of single-row UPDATEs in a transaction, or one huge update with SET index = CASE id WHEN <id1> THEN <index1>...? Or is there a better way to do it?

I do not feel to strongly about a particular implementation- there are all kinds of possibilities:

a) Use references to minimize the amount of data sent and written

DELETE FROM sort_table where list_id = X;
INSERT INTO sort_table (list_id, item_id, sort_index) VALUES (1234234, 234563, 1), (1234234, 4325435, 2), (1234234, 543454432, 3)...

b) denormalize as it will always be written and read in bulk (ok for small number of items, and but you cannot do anything but read and write in full batches):

UPDATE lists SET list_order = '234563,4325435,543454432' WHERE list_id = 1234234;

c) Implement an LSM on top of a relational system:

INSERT INTO user_log (user, action, list, value1, value2) VALUES (1, 'interchange_positions', 123143, 4323424)

+ batch job to purge action and normalize results

This is just 3 examples of many possibilities, (a) would allow easier reads and is more flexible (b) may be faster but it is very inflexible (c) is the faster writes, but it would be a nightmare to maintain, and users would not see changes immediately (well, maybe with some smart caching). I do not know which model will be preferred for what you want to accomplish.

Sorry, I am not sure if I am helping or getting you more confused. Sorry.

Thanks. The first option seems simple enough, and avoiding premature optimization is good advice. I am going with that for now; we can discuss more in the RfC (T164990).