Page MenuHomePhabricator

MediaWiki does not support consistent pagination on non-unique fields
Open, Needs TriagePublic

Description

Problem
MediaWiki's IndexPager uses an offset URL parameter to determine where the next page should start. This strategy is described well in this blog post.

The strategy has a lot of benefits. Primarily it is much faster and prevents the user from getting duplicated results as new data is added to the table while they are paginating the results. However, there are some drawbacks. The biggest one is that it only works properly on unique database columns. If the sorting column is non-unique, it shows the user duplicate content, which is the very problem it seeks to solve.

During the development of T237300 it became clear that there is a need to be able to sort a paginated table of data on a non-unique field.

Proposed Solution
Obviously it is better for developers to use the seek method whenever possible (and that should be the default).

However, MediaWiki, could provide the option of allow sorting on non-unique fields by using the SQL OFFSET. This is only a major performance problem if the user goes beyond a reasonable number (50?) of pages. To limit the performance impact, this feature should be limited to the UI and not available in the API. It should also be limited to read-only database queries so the replicas are used. This feature could also be restricted to only authenticated users.

Alternative Solutions
T244492 will add support for using a multi-column offset, which could be used with a primary key to achieve non-unique field pagination, however, one of our uses cases is to paginate a query with a GROUP BY statement which results in an ambiguous primary key.

Work Around
We could extend IndexPager in CheckUser extension and implement an SQL OFFSET based pagination within the extension itself. This would localize the pattern to only where it is truly needed.

Event Timeline

dbarratt updated the task description. (Show Details)

I see T244579 is done: the change is merged so it presumably should be closed. Does that answer all use cases in T236225, or is there something remaining that still needs OFFSET?

I see T244579 is done: the change is merged so it presumably should be closed. Does that answer all use cases in T236225, or is there something remaining that still needs OFFSET?

As part of T237300 it appears that we do need to be able to use an OFFSET based sort since the query contains a GROUP BY and the primary key is ambiguous. We kicked the can on this a bit until this task can be resolved since there seems to be some objections to doing it this way.

Obviously this isn't an ideal solution. I'm not advocating a general usage of the pattern, but in this use-case, it doesn't seem like there is an alternative. Or at least not one that I am aware of.

The product we're discussing (an expansion to CheckUser) deals with potentially large chunks of data from the database, which triggers a number of problems with regard to performance. Regardless of what method we use, we have to come up with a good way to deal with that. After discussing with the team and talking things with @Catrope for some advice around how CheckUser has historically extracted potentially large chunks of information to process, I think we have a slightly changed strategy, and have a plan.

In specific, a couple of points about this specific product:

  • We've agreed in the team that we will not use OFFSET
  • We've looked into how the current CheckUser operation works, and have identified potential ways for us to solve our current problem with the new product.
  • The product is flexible, and our product manager has told us that if we need to adjust some of the outputs to fit the engineering, that's doable.

With all that in mind, I think we're going to attempt first to explore the method that does not involve using OFFSET with GROUP BY for this product.

explore the method that does not involve using OFFSET with GROUP BY for this product.

What is that method? The only alternative I can think of is to aggregate the data and store it in a new table, thus removing the need for a GROUP BY. This would be similar to the way user.user_editcount works. If that is the preferred solution, then fantastic. 🙂 It felt like a little too much overhead on writes, but I suppose it's no more overhead than updating the user's edit count.