Page MenuHomePhabricator

Orphaned revisions with rev_page pointing to nonexistent page
Open, Needs TriagePublicBUG REPORT

Description

Revisions on enwiki that have a non-zero rev_page pointing to a page that doesn't exist:

587041661
637192080
639225502
649996142

By itself this doesn't seem to cause any user-facing problems, but if it means we get different results based on whether we're JOINing on the page table. T33864: Make sure revision queries JOIN on page in most places suggests orphaned revisions are known to happen.

Original report for XTools

The yearly and monthly bar charts should show every live edit. However, I have noticed that if you add together every bar from the yearly bar chart, that total is different from the total reported at the top of the page as the "live" total. Specifically, for mine - https://xtools.wmcloud.org/ec/en.wikipedia/Enterprisey - the bars total up to 31988 but the total reported at the top is 31992. I have no idea where the four extra edits are coming from.

XTools version: 3.18.6-bfa68078

Event Timeline

MusikAnimal subscribed.
SELECT COUNT(rev_id) FROM revision_userindex WHERE rev_actor = 75321;

gives 31992 as you report.

For year counts, we actually just sum the month values. Here's a simplified version of the query:

SELECT YEAR(rev_timestamp) AS `year`,
    MONTH(rev_timestamp) AS `month`,
    page_namespace AS `namespace`,
    COUNT(rev_id) AS `count`
FROM revision_userindex JOIN page ON (rev_page = page_id)
WHERE rev_actor = 75321
GROUP BY YEAR(rev_timestamp), MONTH(rev_timestamp), `namespace`

I then imported that data into a spreadsheet and summed the count column, and indeed it gives 31988.

The queries look correct to me. My first guess was this could be an issue with the subqueries the Toolforge replicas make behind the scenes, or perhaps something with the revision_userindex view, but I'm seeing the same results when I run the same queries on production. Hmm…

SELECT COUNT(rev_id)
FROM revision_userindex
JOIN page ON rev_page = page_id
WHERE rev_actor = 75321;

returns 31988. So the JOIN on the page table is what's causing the inconsistency.

On enwiki, the page IDs for those four edits don't seem to exist in the page table: (query https://quarry.wmcloud.org/query/79659)

SELECT
  *
FROM
  revision_userindex
  LEFT JOIN `page` p ON rev_page = page_id
WHERE
  rev_actor = 75321 AND
  p.page_id IS NULL
ORDER BY rev_timestamp ASC
rev_id,rev_page,rev_comment_id,rev_actor,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,page_id,page_namespace,page_title,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model,page_lang
587041661,41422445,187501682,75321,20131221021132,0,0,68,0,mubb4bnz42to8ix30ctypafjird48f9,,,,,,,,,,,,
637192080,44637671,200767530,75321,20141208171824,0,0,44,0,k6tl059vrpxmkz70m7dsvtxxf1do1z8,,,,,,,,,,,,
639225502,44808986,201281406,75321,20141222193010,0,0,59,0,rbnicsooya67s53te7euj0a4py5lxk9,,,,,,,,,,,,
649996142,45584209,204115633,75321,20150305144034,0,0,52,0,exl00qndkvhzu759yb0mgl7ffmkudrt,,,,,,,,,,,,

However, on testwiki, I've also found four edits that are "missing" like this, not from me, but from AGK (randomly chosen from ListAdmins), but these have rev_page values of 0 instead: (query https://quarry.wmcloud.org/query/79657)

SELECT
  *
FROM
  revision_userindex
  LEFT JOIN `page` p ON rev_page = page_id
WHERE
  rev_actor = 67 AND
  p.page_id IS NULL
ORDER BY rev_timestamp ASC
rev_id,rev_page,rev_comment_id,rev_actor,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,page_id,page_namespace,page_title,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model,page_lang
164253,0,48049,67,20120422202503,1,0,49,0,0ldr1bwfa2dk0in08ye1irt2ff6cgof,,,,,,,,,,,,
164246,0,48043,67,20120423102405,1,0,47,0,fah2f2w356lmfutqtthrbsa95pbadbg,,,,,,,,,,,,
164245,0,48041,67,20120423103411,1,0,50,0,fpn7y39fucdv6m3qlc3vycq0xml4et7,,,,,,,,,,,,
164247,0,48045,67,20120423103413,1,0,48,0,5frf7wwec0jiyqh3yf68re7d0rqg8cn,,,,,,,,,,,,

Update on those enwiki edits: a friendly oversighter (suppress for everyone here) has confirmed that "You have either not specified any target revision on which to perform this function, or the specified revision does not exist, or you are attempting to hide the current revision." appears when they attempt to query if anything's interesting about those edits from an oversighter perspective. I'm not sure how accurate that message is if I got the revid from the revision table! :p

SELECT COUNT(rev_id)
FROM revision_userindex
LEFT OUTER JOIN page ON rev_page = page_id
WHERE rev_actor = 75321;

that gives the correct count, and doesn't seem to have a large performance impact. That's probably fine for counting, but say if we wanted to iterate over the results, we can't show the page title. There's still a data integrity issue here…

MariaDB [enwiki_p]> SELECT page_namespace, page_title FROM page WHERE page_id IN (41422445, 44637671, 44808986, 45584209);
Empty set (0.002 sec)

Here's the edit summaries:

MariaDB [enwiki_p]> SELECT comment_text FROM comment WHERE comment_id IN (187501682, 200767530, 201281406, 204115633);
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| comment_text                                                                                                                                                                                                       |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| APerson moved page [[Talk:Artificial neural network/archive]] to [[Talk:Artificial neural network/Archive]]: As per [[H:ARC]], standard archive naming style                                                       |
| APerson moved page [[Draft:Fig Tree Books]] to [[Fig Tree Books]]: Publishing accepted [[Wikipedia:Articles for creation|Articles for creation]] submission ([[WP:AFCHRW|afch-rewrite]] 0.9)                       |
| APerson moved page [[User talk:Ddmteam]] to [[Draft talk:Symbiote (company)]]: Preferred location for AfC submissions                                                                                              |
| APerson moved page [[User:Peaceray/sandbox/Template:Infobox beach]] to [[Template:Infobox beach]]: Publishing accepted [[Wikipedia:Articles for creation|Articles for creation]] submission ([[WP:AFCH|AFCH]] 0.9) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.012 sec)

Focusing in on the first example, we have a rev_page of 41422445. There is no page with such ID, however we can determine it was at some point Talk:Artificial_neural_network/archive by querying logging:

mysql:research@dbstore1008.eqiad.wmnet [enwiki]> SELECT log_namespace, log_title FROM logging WHERE log_page = 41422445;
+---------------+-----------------------------------+
| log_namespace | log_title                         |
+---------------+-----------------------------------+
|             1 | Artificial_neural_network/archive |
+---------------+-----------------------------------+
1 row in set (0.001 sec)

The current ID for this page is 41422454, so whatever happened was in short duration after the original move. The log shows another move with a redirect left behind. So, it would seem something went wrong when moving the page.

The other examples from enwiki appear to now point to a deleted page, but similarly the data corruption appears to have happened during a page move prior to deletion.

MusikAnimal renamed this task from Live count doesn't match total from yearly/monthly bar chart to Orphaned revisions with rev_page pointing to nonexistent page.Jan 16 2024, 4:22 AM
MusikAnimal updated the task description. (Show Details)

However, on testwiki, I've also found four edits that are "missing" like this, not from me, but from AGK (randomly chosen from ListAdmins), but these have rev_page values of 0 instead: (query https://quarry.wmcloud.org/query/79657)

A rev_page with ID 0 seems to be a known issue (T309185, T18674).

Anyway, for XTools, I think I'll just take the advice of T33864 and always JOIN on page. From what I can tell, the orphaned revisions found in this task are simply not supposed to exist, so it's safe to just skip them.

This is fixed in XTools now, and will go out with the next release (soonish). Removing XTools as the Wikimedia-database-issue is still valid.