Page MenuHomePhabricator

Fix broken Forgotten Articles report
Closed, ResolvedPublic

Description

Referring to: https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Forgotten_articles

I also came across this: https://en.wikipedia.org/wiki/Special:AncientPages - itself a dormant page. Most of the linked pages are Disambiguation pages.

Event Timeline

Niharika claimed this task.
Niharika raised the priority of this task from to Medium.
Niharika updated the task description. (Show Details)
Niharika added subscribers: kaldari, Legoktm, Krenair and 3 others.

Code is at https://github.com/Niharika29/database-reports.

Looks pretty good to me. I think it would also be useful to include:

  • The total number of edits for each article
  • A description of the list at the top of the page

Looks pretty good to me. I think it would also be useful to include:

  • The total number of edits for each article
  • A description of the list at the top of the page

{{done}} https://test.wikipedia.org/wiki/Forgotten_articles

It doesn't look like the existing code actually filters out disambiguation pages (although the report description says it does). To exclude disambiguation pages from your existing query, you'll want to add the page_props table, left join on page_id = pp_page and pp_propname = 'disambiguation', and add the where condition pp_page IS NULL (I think).

Unfortuantely, this is difficult to test on test.wiki since making a page a disambiguation page will cause it to no longer be a forgotten article due to the recent edit.

Also, I would advise against running this every day since it's going to be an expensive query and the output isn't likely to change very often. How about once a week? There's a chance this script may run out of memory when running against the en.wiki tables, so you may have to increase the memory allocation in the cron job.

Also Yuvi says that as long as your query takes 20 minutes or less, that should be OK. If it's taking longer than that to run on en.wiki, we may want to look at ways to break it into smaller chunks.

The underscores in the article titles need to be changed to spaces.

Also, it isn't necessary to include the "Data as of: 17:27, 08 September 2015 (UTC)" since this is already in the signature above it.

The underscores in the article titles need to be changed to spaces.

Doesn't seem trivial. No obvious benefits either.

Can't you just pass all the article titles through a regex to replace the underscores with spaces? The underscore versions are the database names, while the space versions are the canonical names. Only the canonical names should actually be output as links on wikipages. The benefit is that they are easier to read.

Can't you just pass all the article titles through a regex to replace the underscores with spaces? The underscore versions are the database names, while the space versions are the canonical names. Only the canonical names should actually be output as links on wikipages. The benefit is that they are easier to read.

All the other reports follow the same format as well. Is it alright if I create another task for this as a "nice to have" and fix other broken reports first?

All the other reports follow the same format as well. Is it alright if I create another task for this as a "nice to have" and fix other broken reports first?

None of the reports that we are generating should use the raw database names as links. It just isn't correct. Only canonical names should be used in WikiText. It looks sloppy otherwise. In the Title class in MediaWiki the conversion is handled with a single line:

$s = strtr( $s, '_', ' ' );

If you need to do the conversion within mySQL, you can use REPLACE:

SELECT REPLACE( p.page_title, '_', ' ' ) ...

Just make sure that you use REPLACE consistently on all fields that are page titles, otherwise comparisons won't work correctly.

If you really, really think it's going be difficult, I can defer it to a separate card, but I have a hard time believing it would be that difficult.

Done. The job is running on Labs and will take a few hours to finish. You can check on testwiki here meanwhile: https://test.wikipedia.org/wiki/Wikipedia:Database_reports/Forgotten_articles

It wasn't difficult. I was mistaken. :)

@NiharikaKohli: Looks like there's a bug. At https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Forgotten_articles, the list is good until it gets to 2005 Australian Production Car Championship. All the entries after that have a bogus 'Last edited' date.

@NiharikaKohli: You can actually make the linkify() function even simpler:

return '[[' + title_clean + ']]'

...instead of...

return '[[' + title + ' | ' + title_clean + ']]'

... since you can use the canonical title as a link directly.

I wonder if the bug mentioned above is actually a bug. Perhaps there are deleted revisions for those articles or some other reason that the page_touched value doesn't match the timestamp of the last revision on the history page.

Maybe @Jalexander would know.

@kaldari, I think I know what's up. I'm using the page_touched value (from the page table) instead of actually looking up the last revision for the page and looking at the rev_timestamp. The two values are a little different.

page_touched:
This timestamp is updated whenever the page changes in a way requiring it to be re-rendered, invalidating caches. Aside from editing this > includes permission changes, creation or deletion of linked pages, and alteration of contained templates.

and

rev_timestamp:
Holds the timestamp of the edit.

This is because the query was becoming way too time-consuming with another join - it runs for about 5 hours 23 minutes right now. I did perform some tests to see what the usual difference between those two values is, and it came out to be negligible for all the pages I tested, so I let it be. Ideally though, it should be joined with revision table and the last revision timestamp should be displayed instead.

Let me know what you prefer?

@NiharikaKohli: Hmm, let's just change the 'Last edited' column to say 'Last touched', and add an explanation at the top of the page that explains what that means.

Bot request approved. Re-activated the cron job and updated the edit summary.