Page MenuHomePhabricator

[REQUEST] Get page creation time data and relationship between quality and time of deletion
Closed, ResolvedPublic

Description

Requested by @Samwalton9

Would it be possible to pull the average (or distribution) of times between a page being created and being tagged/deleted on en.wiki? Need to get the page's initial creation time.
Bonus question: I wonder if we could investigate the qualities of a page that correlate with how long it survives before being tagged for deletion (e.g. page length, references, Wiki formatting)

Event Timeline

Sam -the closest I have gotten to finding any info on the deleted pages was using logging.log_id to query archive and revision using archive.ar_page_id = logging.log_id and logging.log_id=revision.rev_page. (I added log_id to your query and made a spreadsheet with results here.

I found records in revision and archive that are dated far into the future of the original log_timestamp so I'm not 100% confident that these are all relevant. Set up a meeting with me if you have questions or want to discuss. Only people with Admin access can see deleted pages and it might be worth getting that so you can verify what's coming out of the tables. I will email you two csvs of sample data, I ran queries against 141 log_ids.

I don't know if there is a way to get page quality info from these pages, looks like info is scarce once a page is deleted.

The query I used for archive:

SELECT
  ar_page_id,
  log_id, -- redundant, just checking to make sure it matches
  ar_title,
  ar_timestamp,
  log_timestamp,
  log_title,
  comment_text
FROM archive
LEFT JOIN `comment` ON ar_comment_id = `comment`.comment_id
LEFT JOIN logging ON logging.log_id=archive.ar_id
WHERE ar_id in (61304953,
61304963,
61340243)"

query for revision:

SELECT
  rev_id,
  rev_page,
  log_id,
  comment_text,
  rev_timestamp,
  log_timestamp,
  log_title,
  rev_deleted,
  rev_len,
  rev_parent_id 
FROM revision
LEFT JOIN revision_comment_temp rct ON revision.rev_id = rct.revcomment_rev
LEFT JOIN `comment` ON rct.revcomment_comment_id = `comment`.comment_id
LEFT JOIN logging ON logging.log_id=revision.rev_page
WHERE rev_page IN (61304953,
61304963,
61340243)

Thanks for this!

I found records in revision and archive that are dated far into the future of the original log_timestamp so I'm not 100% confident that these are all relevant

Yeah I'm a little confused by the resulting csv files. The top entry is for the Shawntefaye'Aonna page (https://en.wikipedia.org/wiki/Shawntefaye'Aonna). The rev_id in the csv, however, is for this edit: https://en.wikipedia.org/w/index.php?diff=906744315, which appears unrelated.

Only people with Admin access can see deleted pages

Is there a way to query the databases for deleted content? I have en.wiki administrator access on my personal Wikipedia account but that's not so helpful for these kinds of data queries. It would be neat if we could go back and grab the page creation times for deleted pages directly.

SNowick_WMF lowered the priority of this task from Medium to Low.Dec 1 2020, 6:09 PM
SNowick_WMF moved this task from Next 2 weeks to Blocked on the Product-Analytics (Kanban) board.

@Samwalton9 closing this out, I wasn't able to get more data on the deleted pages. There is a table wmf.mediawiki_page_history that may have archived page data, it looks like the snapshot 2020-10 has page data going back to 1999-11-30.