Page MenuHomePhabricator

updatePageTriageQueue doesn't remove old records if the page doesn't exist
Open, Needs TriagePublic

Description

(in the context of looking for data for something else)

I found the following row in pagetriage_page (among many others):

+--------------+---------------+--------------+----------------+-------------------+-----------------------+-----------------------+
| ptrp_page_id | ptrp_reviewed | ptrp_deleted | ptrp_created   | ptrp_tags_updated | ptrp_reviewed_updated | ptrp_last_reviewed_by |
+--------------+---------------+--------------+----------------+-------------------+-----------------------+-----------------------+
|     35126318 |             3 |            0 | 20091017203259 | 20121225230954    | 20121225230954        |               1115773 |
+--------------+---------------+--------------+----------------+-------------------+-----------------------+-----------------------+

This should have been removed long ago by updatePageTriageQueue, but that script's query includes [ 'page' => [ 'INNER JOIN', 'ptrp_page_id = page_id' ] ]. Trying to find what page that is reveals that it no longer exists, but it can be found in the archive table:

MariaDB [enwiki_p]> SELECT * FROM archive WHERE ar_page_id = 35126318 LIMIT 1;
+-------+--------------+-----------------------------------------+---------+---------------+-----------+----------------+---------------+----------+-----------+------------+--------+------------+--------------+---------------------------------+
| ar_id | ar_namespace | ar_title                                | ar_text | ar_comment_id | ar_actor  | ar_timestamp   | ar_minor_edit | ar_flags | ar_rev_id | ar_deleted | ar_len | ar_page_id | ar_parent_id | ar_sha1                         |
+-------+--------------+-----------------------------------------+---------+---------------+-----------+----------------+---------------+----------+-----------+------------+--------+------------+--------------+---------------------------------+
| 27744 |            4 | WikiProject_Abandoned_Drafts/Robb_Alvey |    NULL |          NULL | 134752204 | 20121224180606 |             0 |     NULL | 529626966 |          0 |   5353 |   35126318 |    529103543 | et1ousxecir2c1jkbsgu6hp4vu2i91d |
+-------+--------------+-----------------------------------------+---------+---------------+-----------+----------------+---------------+----------+-----------+------------+--------+------------+--------------+---------------------------------+

I propose that any record for a page that doesn't exist in the page table be automatically deleted after a wait period. On toolforge replicas, a quick search:
SELECT * FROM pagetriage_page WHERE ptrp_page_id NOT IN (SELECT page_id FROM page);
Returned 14609 such results.

For reference, SELECT COUNT(*) FROM pagetriage_page; returned 153539, meaning that these rows for deleted pages reflect 9.5% of the entire table.

Should this be added to the updatePageTriageQueue script, or should a new script be added?

Event Timeline

Restricted Application added projects: Growth-Team, User-DannyS712. · View Herald TranscriptJun 30 2020, 3:21 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui added a subscriber: Marostegui.

This is more a MW issue than a DB itself issue, however, if in the end it makes sense to delete things, let's make sure it is done in batches and with a wait for replication as DELETEs are expensive.