Page MenuHomePhabricator

Special:BlockList paginates on a non-unique column
Closed, ResolvedPublic

Description

SpecialBlockListPager paginates using the timestamp as an offset to get the next or previous page. When paginating on a non-unique column, results can go missing between pages. (Briefly: if the limit has been reached for this page, but there are still more rows with the same value as in the final row of this page, those extra rows won't show up on the next page either, because the next page will be conditioned on having a greater value in that column.)

Not only are timestamps not unique for blocks; but blocks and autoblocks that are created at the same time have the same timestamp, so it is common for blocks to share timestamps. Whenever such blocks span two pages, those that didn't make it onto this page will be lost. (Though they can still be seen by changing the page length, filtering, going backwards from the next page, etc.)

The ipb_address column should be unique in practice; however, paginating on this column instead of timestamp would mean that we would order by username/IP, which is less useful than ordering by timestamp.

If IndexPager could paginate on multiple columns that are unique in combination, then we could paginate on the combination of timestamp and ipb_address. We would order by timestamp then address, and no rows would go missing between pages.

Since we are working on this problem for CheckUser, we should consider upstreaming it and fixing Special:BlockList at the same time.

Event Timeline

Tchanders created this task.Feb 3 2020, 6:56 PM
Restricted Application added subscribers: MGChecker, Aklapper. · View Herald TranscriptFeb 3 2020, 6:56 PM

This may be a non-starter for now, because the solution involves exposing the IP of autoblocks in the URL, as the offset.

Perhaps some time in the future, once we have a system for hiding IPs, we'll be able to revisit this task.

@Tchanders now that T244492 has been resolved I imagine this can be done?

This may be a non-starter for now, because the solution involves exposing the IP of autoblocks in the URL, as the offset.

I think you could always use the Block ID and that would prevent that?

@dbarratt Good point.

If we were certain that timestamp always increased with ID, we could even paginate on the single ipb_id column. It seems this is not the case (see DatabaseBlock::updateTimestamp), so we can paginate on timestamp and ID together.

Change 572861 had a related patch set uploaded (by Tchanders; owner: Tchanders):
[mediawiki/core@master] BlockListPager: Paginate on timestamp and block ID

https://gerrit.wikimedia.org/r/572861

If we were certain that timestamp always increased with ID, we could even paginate on the single ipb_id column. It seems this is not the case (see DatabaseBlock::updateTimestamp), so we can paginate on timestamp and ID together.

Yeah the timestamp is non-unique as well, so you could (in theory I suppose) have two different blocks with the same timestamp.

If we were certain that timestamp always increased with ID, we could even paginate on the single ipb_id column. It seems this is not the case (see DatabaseBlock::updateTimestamp), so we can paginate on timestamp and ID together.

Yeah the timestamp is non-unique as well, so you could (in theory I suppose) have two different blocks with the same timestamp.

Which is the whole point of this task... I really need to drink my ☕️ before responding. :)

Change 572861 merged by jenkins-bot:
[mediawiki/core@master] BlockListPager: Paginate on timestamp and block ID

https://gerrit.wikimedia.org/r/572861

dmaza added a subscriber: dmaza.Feb 21 2020, 7:37 PM

I asked on the patch but just in case.. why can't we use ipb_id alone?

dmaza added a comment.Feb 21 2020, 7:44 PM

@dmaza It's explained in T244157#5892494

I don't understand. Why do we care about timestamp when ipb_id is unique and auto-increment?

@dmaza It's explained in T244157#5892494

I don't understand. Why do we care about timestamp when ipb_id is unique and auto-increment?

See T244157#5892494

sequence numbers must not be guarantee (from the DBMS) to be in sequence. (That maybe true for mysql or oracle, but it is not for DB2 - but that is not supported by mediawiki)

It also possible to have maintenance scripts creating old blocks etc.

dmaza added a comment.Feb 21 2020, 7:49 PM

See T244157#5892494

sequence numbers must not be guarantee (from the DBMS) to be in sequence. (That maybe true for mysql or oracle, but it is not for DB2 - but that is not supported by mediawiki)

It also possible to have maintenance scripts creating old blocks etc.

Good point. Got it.

Umherirrender closed this task as Resolved.Feb 21 2020, 7:51 PM
Umherirrender triaged this task as Medium priority.
dom_walden added a subscriber: dom_walden.

I wanted to test that on Special:BlockList:

  • No results (blocks) were missing or duplicated across pages
  • The order of results was "correct".

I used a script to go page-by-page through each limit (20, 50, 100, 250) until 500 results are reached. I then compared these results with a single page of 500 results.

I also compared to the API, so I could also check the results are being returned in the same order as the API.

I only did this for the first 500 results. I did not experiment with the other parameters (e.g. type of block) or going in reverse (from oldest block to newest).

Version: https://en.wikipedia.beta.wmflabs.org MediaWiki 1.35.0-alpha (46776c6) 07:27, 24 February 2020

@dom_walden That's really thorough. Thank you.