Page MenuHomePhabricator

No way to get the ID of a deleted page from deletion logs
Closed, ResolvedPublic

Description

Using the "recentchanges" option, I can get the title of a deleted page but there is no way to learn the corresponding ID. Is there any workaround to this?


Version: unspecified
Severity: minor

Details

Reference
bz26122

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:21 PM
bzimport set Reference to bz26122.

Deleted pages, being deleted, don't have page IDs.

How come deleted pages have names but not IDs?

(In reply to comment #3)

How come deleted pages have names but not IDs?

Because they don't exist any more, they've been deleted.

(In reply to comment #4)

(In reply to comment #3)

How come deleted pages have names but not IDs?

Because they don't exist any more, they've been deleted.

You do not seem to get what I mean here? The name of the deleted page is included in the API response, but pageID (now old) is not. What is the use of delete logs without having the ability to get pageID as well as pagename?

(In reply to comment #5)

You do not seem to get what I mean here? The name of the deleted page is
included in the API response, but pageID (now old) is not. What is the use of
delete logs without having the ability to get pageID as well as pagename?

What would you do with that page ID? The page is no longer known by that ID in the page table or anywhere else.

(In reply to comment #6)

(In reply to comment #5)

You do not seem to get what I mean here? The name of the deleted page is
included in the API response, but pageID (now old) is not. What is the use of
delete logs without having the ability to get pageID as well as pagename?

What would you do with that page ID? The page is no longer known by that ID in
the page table or anywhere else.

I need the pageID as my logging script checks to see if a certain page is deleted and I do not prefer to use pagenames for efficiency reasons. The (old) pageID must be included in the API response for delete logs.

(In reply to comment #7)

I need the pageID as my logging script checks to see if a certain page is
deleted and I do not prefer to use pagenames for efficiency reasons. The (old)
pageID must be included in the API response for delete logs.

If you want to do an existence check, use the title, not the page ID.

(In reply to comment #8)

(In reply to comment #7)

I need the pageID as my logging script checks to see if a certain page is
deleted and I do not prefer to use pagenames for efficiency reasons. The (old)
pageID must be included in the API response for delete logs.

If you want to do an existence check, use the title, not the page ID.

It results in space inefficiency. Any legitimate reason for not including the old pageID in the API response for recent changes?

Bryan.TongMinh wrote:

(In reply to comment #9)

(In reply to comment #8)

(In reply to comment #7)

I need the pageID as my logging script checks to see if a certain page is
deleted and I do not prefer to use pagenames for efficiency reasons. The (old)
pageID must be included in the API response for delete logs.

If you want to do an existence check, use the title, not the page ID.

It results in space inefficiency. Any legitimate reason for not including the
old pageID in the API response for recent changes?

Because it is not stored anywere. MediaWiki forgets the page id once a page is deleted.

(In reply to comment #10)

(In reply to comment #9)

(In reply to comment #8)

(In reply to comment #7)

I need the pageID as my logging script checks to see if a certain page is
deleted and I do not prefer to use pagenames for efficiency reasons. The (old)
pageID must be included in the API response for delete logs.

If you want to do an existence check, use the title, not the page ID.

It results in space inefficiency. Any legitimate reason for not including the
old pageID in the API response for recent changes?

Because it is not stored anywere. MediaWiki forgets the page id once a page is
deleted.

It is stored in ar_page_id as far as I am aware.

(In reply to comment #11)

(In reply to comment #10)

(In reply to comment #9)

(In reply to comment #8)

(In reply to comment #7)

I need the pageID as my logging script checks to see if a certain page is
deleted and I do not prefer to use pagenames for efficiency reasons. The (old)
pageID must be included in the API response for delete logs.

If you want to do an existence check, use the title, not the page ID.

It results in space inefficiency. Any legitimate reason for not including the
old pageID in the API response for recent changes?

Because it is not stored anywere. MediaWiki forgets the page id once a page is
deleted.

It is stored in ar_page_id as far as I am aware.

There is no such key in the response array.

(In reply to comment #12)

It is stored in ar_page_id as far as I am aware.

There is no such key in the response array.

He meant the ar_page_id field in the database.

And yes, it's stored there, but it doesn't really have any meaning. It's not even used for restoring the page (although it should be; there's a separate bug about that).

(In reply to comment #13)

(In reply to comment #12)

It is stored in ar_page_id as far as I am aware.

There is no such key in the response array.

He meant the ar_page_id field in the database.

And yes, it's stored there, but it doesn't really have any meaning. It's not
even used for restoring the page (although it should be; there's a separate bug
about that).

The topic is related to an API request, so there is no need to say ar_page_id stores it since it is unreachable. This is a major issue and I cannot get why it is too hard to include this in the response. The unmeaningful pageID key (0 for deleted pages) is there, but old page ID is not. That is a shame!

(In reply to comment #14)

The topic is related to an API request, so there is no need to say ar_page_id
stores it since it is unreachable.

It's actually a very useful thing to say -- it indicates to the other developers that yes, there is a way internally to get that information, and therefore that's how an implementation would be built to expose the information.

Please bear with us; these discussions sometimes take a while to get everyone on the same page!

A general note however: the original page_id number isn't locked to a page title as such, but is rather a property of each individual page *revision* that's been deleted.

There may be multiple past page IDs that have belonged to a given title. In fact, the same deleted page ID may be associated with multiple past titles, if it's had individual revisions deleted while it's been in the system if it's been renamed over time.

So depending on exactly what sort of request you're pulling, it might or might not be appropriate or straightforward to pass back a page ID from archive.ar_page_id.

superyetkin -- can you give an example of a recentchanges API request that you're doing, so we can confirm what format style is there and think about how it might be able to fit in there?

(In reply to comment #15)

There may be multiple past page IDs that have belonged to a given title. In
fact, the same deleted page ID may be associated with multiple past titles, if
it's had individual revisions deleted while it's been in the system if it's
been renamed over time.

So depending on exactly what sort of request you're pulling, it might or might
not be appropriate or straightforward to pass back a page ID from
archive.ar_page_id.

Given these considerations, deleted page IDs are utterly useless AFAICT. If you have a use case other than "I want to save 20 bytes of bandwidth by not using titles", I'd love to hear it.

A good use case would be performing cleanup on a page that was deleted without worrying about accidentally deleting a different page of the same name, since something pulling data from recentchanges would not be synchronous. Using title as primary key could end up referring to the wrong page, for instance if pages have been shuffled around and some of the pages deleted and recreated.

(It may be that this isn't a really suitable use for the recentchanges API; this sort of thing was definitely an issue in developing the OAI extension, which is actually designed to serialize the latest actions on wiki pages into a stream you can pull to build an up-to-date replica.)

Backing up to more general cases -- there *is* a log_page field which holds an optional page ID for referenced items in log events. For deletion events, this currently seems to store a 0. Storing the pre-deletion page ID, and exposing that through API requests for log events, probably makes a lot of sense.

Possibly that would/could/should be accessible through recentchanges as well, or possibly not.

(In reply to comment #16)

(In reply to comment #15)

There may be multiple past page IDs that have belonged to a given title. In
fact, the same deleted page ID may be associated with multiple past titles, if
it's had individual revisions deleted while it's been in the system if it's
been renamed over time.

So depending on exactly what sort of request you're pulling, it might or might
not be appropriate or straightforward to pass back a page ID from
archive.ar_page_id.

Given these considerations, deleted page IDs are utterly useless AFAICT. If you
have a use case other than "I want to save 20 bytes of bandwidth by not using
titles", I'd love to hear it.

I cannot believe it! You are saying using page IDs just does not make sense? To answer your question: YES, I want to utilize space efficiency and prefer to use page IDs instead of pagenames. Satisfied?

Can you answer my previous question that if there is a logical reason why a defunct pageid (0 for deleted pages) is returned and not the old ID? Can you not see that my request is ONLY for delete logs?

(In reply to comment #18)

Can you answer my previous question that if there is a logical reason why a
defunct pageid (0 for deleted pages) is returned and not the old ID?

Yes, because:

  • it's quite possible that there are multiple page IDs, or none at all (if the page never existed)
  • there's barely any use for it/them, if at all
  • we'd have to go through extra trouble (query an additional table) to get it

Can you
not see that my request is ONLY for delete logs?

I can see that just fine; I have eyes, you know. It doesn't matter much what it's for.

(In reply to comment #15)

(In reply to comment #14)

The topic is related to an API request, so there is no need to say ar_page_id
stores it since it is unreachable.

It's actually a very useful thing to say -- it indicates to the other
developers that yes, there is a way internally to get that information, and
therefore that's how an implementation would be built to expose the
information.

What is the use for saying it if that piece of information is not accessible?
This is a bug report about the API, so we do not need to know it is stored in
the database, right? I do not think there is a single "developer" who think it
is not stored in the database!

(In reply to comment #19)

(In reply to comment #18)

Can you answer my previous question that if there is a logical reason why a
defunct pageid (0 for deleted pages) is returned and not the old ID?

Yes, because:

  • it's quite possible that there are multiple page IDs, or none at all (if the

page never existed)

  • there's barely any use for it/them, if at all
  • we'd have to go through extra trouble (query an additional table) to get it

Can you
not see that my request is ONLY for delete logs?

I can see that just fine; I have eyes, you know. It doesn't matter much what
it's for.

We are not talking about pages that did not exist here. My quesy is about the delete logs, so this is off-topic. You do not seem to get what we are behind. You just cannot force people to store pages with name instead of ID. This is illogical.

Bryan.TongMinh wrote:

(In reply to comment #20)

(In reply to comment #15)

(In reply to comment #14)

The topic is related to an API request, so there is no need to say ar_page_id
stores it since it is unreachable.

It's actually a very useful thing to say -- it indicates to the other
developers that yes, there is a way internally to get that information, and
therefore that's how an implementation would be built to expose the
information.

What is the use for saying it if that piece of information is not accessible?
This is a bug report about the API, so we do not need to know it is stored in
the database, right? I do not think there is a single "developer" who think it
is not stored in the database!

This is a discussion forum not only to request features, but also on how to implement them. The fact that the deleted page id is stored in the database is not as obvious as it may seem to your misinformed mind.

Also, please try to be respectful in your responses here; your current tone is absolutely disrespectful and arrogant towards the developers who are considering whether or not your request has merit. I might add to this that people are much more likely to fulfill a civil and respectful request, than one like yours.

If you are unable to do so, go somewhere else and live with the fact that your requested feature may not be implemented.

(In reply to comment #22)

(In reply to comment #20)

(In reply to comment #15)

(In reply to comment #14)

The topic is related to an API request, so there is no need to say ar_page_id
stores it since it is unreachable.

It's actually a very useful thing to say -- it indicates to the other
developers that yes, there is a way internally to get that information, and
therefore that's how an implementation would be built to expose the
information.

What is the use for saying it if that piece of information is not accessible?
This is a bug report about the API, so we do not need to know it is stored in
the database, right? I do not think there is a single "developer" who think it
is not stored in the database!

This is a discussion forum not only to request features, but also on how to
implement them. The fact that the deleted page id is stored in the database is
not as obvious as it may seem to your misinformed mind.

Also, please try to be respectful in your responses here; your current tone is
absolutely disrespectful and arrogant towards the developers who are
considering whether or not your request has merit. I might add to this that
people are much more likely to fulfill a civil and respectful request, than one
like yours.

If you are unable to do so, go somewhere else and live with the fact that your
requested feature may not be implemented.

I am also a developer and know what respect means, so you are the last person to "teach" me what moral values mean. However, I can tell that your tone does not seem friendly, so you need to watch your words before they come to your mouth.

I am not begging for anything here, but it is just unreasonable to include the meaningless pageid parameter for deleted pages in the API response. Also, you cannot explain how you would use the archive ID because it has no meaning for API requests.

This is a place of business where people do work for the common benefit; please limit yourself to polite, on-topic discussion or your Bugzilla account will be suspended.

Reassigning priority to "minor". Reopening bug; the Bugzilla account of the submitter has been blocked so that productive comments can still be added.

Here's a sample API recentchanges query that pulls log entries:

http://en.wikipedia.org/w/api.php?action=query&list=recentchanges&rctype=log&rcprop=loginfo

A page deletion comes up like this:

<rc type="log" logid="33091010" logtype="delete" logaction="delete">
  <param />
</rc>

which indeed isn't super detailed.

(The log_params field is not used for these entries, nor apparently is there special handling as there is for move entries in the API output format.)

There's also the logevents query, which provides a different format:

http://en.wikipedia.org/w/api.php?action=query&list=logevents&lelimit=200

<item logid="33091102" pageid="0" ns="10" title="Template:Le Tourment Vert" type="delete" action="delete" user="WOSlinker" timestamp="2010-12-01T21:35:01Z" comment="[[WP:CSD#T3|T3]]: Unused, redundant template" />

This looks like it would directly expose the log_page ID value if it were stored at delete logging time.

Created attachment 7882
Partial patch: records old page_id into log_page on page deletions

This is a quick patch which moves the resetting of the article id on the title object from before to after the log entry saving in Article::doArticleDelete().

With this change in, the old page ID now gets stored into log_page in the logging table record instead of it recording 0.

However, the API logevents query does not appear to be using that value in its output; in fact it appears to pull whatever the current page ID for the logged title is, regardless of what's recorded as log_page. (Eg, if you create a new page with the same title, logevents shows you the page ID of the *new* page on all log entries for the old page, even those that recorded a different, older page ID.)

Attached:

(In reply to comment #26)

Created attachment 7882 [details]
Partial patch: records old page_id into log_page on page deletions

This is a quick patch which moves the resetting of the article id on the title
object from before to after the log entry saving in Article::doArticleDelete().

Have you tested this in the UI too? At first sight it looks like this would produce a blue link rather than a red link in Special:Log and RC.

However, the API logevents query does not appear to be using that value in its
output; in fact it appears to pull whatever the current page ID for the logged
title is, regardless of what's recorded as log_page. (Eg, if you create a new
page with the same title, logevents shows you the page ID of the *new* page on
all log entries for the old page, even those that recorded a different, older
page ID.)

That's an interesting bug, lemme look at that.

Attached:

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

sumanah wrote:

Inferring from discussion that Brion's patch still needs further review, so, +need-review. And Roan, Brion, did you open a new bug re the API logevents issue regarding page ID vs log_page, or fix it?

sumanah wrote:

Brad, can you take a look at this?

I folks. I just came across this bug. I figured it might be helpful if I added my own use-case.

I'm a research scientist working in the analytics team at the WMF. I'm working with a the Growth team to redesign page creation for newcomers and we'd like to understand how page creation/deletion worked historically. To do that, I'm trying to reconstruct the history of page creations, deletions and moves.

In order to track edits to pages that lead up to deletion, I'm using a combination of the Wikipedia API, the archive table and the logging table. Because of this bug, I'm unable to join archive revisions with their "delete" event without matching both titles and a range of timestamps (which is slow to say the least). This isn't just more difficult and time consuming, it's also much more error prone.

Storing the ID of the page at time of deletion in the log_page field would resolve this issue for me.

Change 113523 had a related patch set uploaded by leucosticte:
Implement way to get the ID of a deleted page from deletion logs. WikiPage::doDeleteArticleReal will tell ManualLogEntry::insert() what the page_id is, so it can be stored in log_page; then ApiQueryLogEvents will provide that data.

https://gerrit.wikimedia.org/r/113523

I just assigned this to myself today, and I've been working on it. In the future, please coordinate.

(In reply to Matthew Flaschen from comment #33)

I just assigned this to myself today, and I've been working on it. In the
future, please coordinate.

Yeah, the assignment thing is kind of a no-win situation sometimes, because when I try to coordinate, some people will say, "NO! I'm working on it!" Then six months later of people bugging them periodically, still nothing. But I'll note you down as a non-cookie-licker for future reference.

Change 113523 abandoned by leucosticte:
Implement way to get the ID of a deleted page from deletion logs. WikiPage::doDeleteArticleReal will tell ManualLogEntry::insert() what the page_id is, so it can be stored in log_page; then ApiQueryLogEvents will provide that data.

Reason:
matt's writing his own patch

https://gerrit.wikimedia.org/r/113523

Change 113525 had a related patch set uploaded by Mattflaschen:
WIP: Store the page_id in the logging table for deletions.

https://gerrit.wikimedia.org/r/113525

I no longer consider this a draft, and would appreciate reviews. The commit should explain itself pretty well in the commit message and release notes, but here are a couple notes:

  • This is the first usage of log_page in ApiQueryLogEvents. It always pulls the pageid out of log_page for deletions.

Before, as Brion noted, it would use the page_id as of query time (if the page even exists then), which was wrong, since it was unrelated to the deletion action. When the page didn't exist at query time (which of course is common), it would use 0 since the join found nothing.

  • Special:Log works fine; it does not affect which links are red.

Change 113525 merged by jenkins-bot:
Store page_id in logging table for deletions and make queryable

https://gerrit.wikimedia.org/r/113525