Page MenuHomePhabricator

A small enhancement to Special:Shortpages
Closed, DeclinedPublic

Description

Author: mathwizard1232

Description:
This is my first time using bugzilla, so if I did something wrong, let me know.
I wrote a small bit of code that I think will make Special:Shortpages more
useful. It checks the text of the pages and excludes those which are redirects,
soft redirects, links to the wiktionary definition (basically soft redirects),
or the template for copyvios. I've tested the code on my sandbox wiki, so I'm
fairly confidant it's stable. The two problems with it that I know of are that
the count at the top is wrong ("Showing below up to 8 results starting with #1."
lists the number before filtering, so will actually be lower) and the related
problem that instead of having 1000 results, it has 1000 - results filtered. I'm
not entirely sure how to fix these yet, but I'll get there.


Version: 1.6.x
Severity: enhancement
URL: http://en.wikipedia.org/wiki/Special:Shortpages

Details

Reference
bz5177

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 9:09 PM
bzimport set Reference to bz5177.

mathwizard1232 wrote:

Diff for patch

I think this is the correct format.

attachment SpecialShortpages.php.patch ignored as obsolete

robchur wrote:

Performance murder.

Excluding based on redirect status, if done, should be done based on the
page_is_redirect field. Pulling from page text is expensive, and "#REDIRECT" is
not the only possible form so it would not be correct.

If done, this should probably be done in the query rather than a filter
afterwards. This may require changes to the indexes on the table to be done
efficiently.

Excluding based on size, if done, should be done based on a cap to the
page_len field in the query.

The filters for specific templates look a bit dodgy; also they're very specific to
one web site and would be incorrect for everybody else in the world. If done, it
may or may not be cheaper and more accurate to check the templatelinks
table rather than loading page text out of indirected, compressed external
storage.

mathwizard1232 wrote:

(In reply to comment #3)

Excluding based on redirect status, if done, should be done based on the
page_is_redirect field. Pulling from page text is expensive, and "#REDIRECT" is
not the only possible form so it would not be correct.

Yeah, that makes sense, I hadn't thought of that.

If done, this should probably be done in the query rather than a filter
afterwards. This may require changes to the indexes on the table to be done
efficiently.

Hm, I'll see if I can find out how to do that in SQL, I don't know that well (or
at all).

Excluding based on size, if done, should be done based on a cap to the
page_len field in the query.

Okay, I'll put that, once I look up how to do that in MySQL.

The filters for specific templates look a bit dodgy; also they're very

specific to

one web site and would be incorrect for everybody else in the world. If done, it
may or may not be cheaper and more accurate to check the templatelinks
table rather than loading page text out of indirected, compressed external
storage.

Yeah, that's true, I have been thinking of wikipedia. I've been thinking about
it, and I think it would probably be enough to exclude any page that includes a
template, that would be much more general, and I think that would help find the
right one.

Performance murder.

Granted, I have no clue how long this would take to run, but since they are
cached for a long time, it would only matter once. I think it would be able to
run during an off-time, I don't think it'd take *too* long. It does take time to
grab the text, but all of these pages will be less than 1k, and the majority
will be around 15 bytes, so I don't think that would be overly time consuming,
at least if only run once in a while.

robchur wrote:

"It does take time to grab the text, but all of these pages will be less than
1k, and the majority will be around 15 bytes"

Er, well, the SQL being run on Special:Shortpages is what's used to cache up the
content on larger installations like the Wikimedia cluster. So the cron job to
cache the things would hit all the pages, thus you're still looking at an
unpleasant load.

Clean implementation in MySQL is probably not too much of a problem, however and
wouldn't be a bad thing; we can justify adding a short time to the execution by
the improved "algorithm" we'd be using and the more accurate result set.

mathwizard1232 wrote:

New patch

Okay, I realized that this script already excludes redirects, so I removed that
from my patch. Added limit to size in the SQL, and generalized script so all
pages with "{{" are excluded. Number problem as above is still present.

attachment SpecialShortpages2.php.patch ignored as obsolete

mathwizard1232 wrote:

Anyone have any comments on my new version of the patch?

robchur wrote:

(In reply to comment #7)

Anyone have any comments on my new version of the patch?

Would still be too expensive to look at the text of each like that. Consider
doing something with the templatelinks table to test for transclusions in the
page; this means your query is more efficient and doesn't pull redundant data
off the MySQL server in the first place.

robchur wrote:

Closing due to a number of reasons.

  1. Inefficient, performance-killing patch.
  2. There is an effective limit on page size provided by the limit on results

when the page is being cached.

  1. That a page contains a template doesn't mean it's a "long" page.