Page MenuHomePhabricator

DBQ-7 Shortpages without templates and a single contributor
Closed, ResolvedPublic

Description

This issue was converted from https://jira.toolserver.org/browse/DBQ-7.
Summary: Shortpages without templates and a single contributor
Issue type: Task - A task that needs to be done.
Priority: Minor
Status: Done
Assignee: Andreas Gasser <andreas_gasser@yahoo.com>


From: MZMcBride <mzmcbride@gmail.com>

Date: Sun, 06 Jan 2008 01:22:25

I'd like a query run that can get a list of all shortpages on en.wiki that don't contain templates and only have a single contributor. Erwin provided me with the query he uses (below), however I've been told that it can probably be optimized or written in a more efficient manner. Thanks!

/usr/bin/mysql -hsql-s1 --skip-column-names enwiki_p > /home/erwin85/public_html/dbq/shortpages_enwiki.txt <<EOF
SELECT CONCAT(ns_name, ':', page_title, ' - ', page_len)
FROM page
LEFT JOIN toolserver.namespace
ON page_namespace = ns_id
WHERE page_len < 50
AND page_is_redirect = 0
AND page_namespace NOT IN (2, 3)
AND (SELECT COUNT(1) FROM templatelinks WHERE tl_from = page_id) = 0
AND (SELECT COUNT(DISTINCT rev_user_text) FROM revision WHERE rev_page = page_id) = 1
AND dbname = 'enwiki_p'
ORDER BY page_len ASC
EOF


Version: unspecified
Severity: minor

Details

Reference
bz59260

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:19 AM
bzimport set Reference to bz59260.

From: DaB. <dab@ts.wikimedia.org>

Date: Sun, 06 Jan 2008 01:52:10

SELECT page_namespace,page_title,page_len
FROM revision,page
LEFT JOIN templatelinks
ON tl_from = page_id
WHERE page_len<50
AND page_is_redirect = 0
AND page_namespace NOT IN (2, 3)
AND rev_page=page_id
AND tl_from IS NULL
GROUP BY rev_user
HAVING count(DISTINCT rev_user)<2
ORDER BY page_len
LIMIT 20;

seems to do what you want (the join with toolserver.namespace is missing, that's trivial ![][1]). But perhaps Duesentrieb knows a better way ![][2]

Sincerly,
DaB.

P.S: on dewp the query needs 8 min 5.23 sec and 0.31 sec on cswp.

[1]: https://jira.toolserver.org/images/icons/emoticons/wink.gif
[2]: https://jira.toolserver.org/images/icons/emoticons/smile.gif

From: Daniel Kinzler <daniel@brightbyte.de>

Date: Sun, 06 Jan 2008 14:16:01

The result is at http://tools.wikimedia.de/~daniel/misc/shortpages-enwiki.txt (39441), details at http://wiki.ts.wikimedia.org/view/Query_service/short_pages_enwiki.

I used the original query, only cleaned it up a bit. I got rid of the first subquery, but as it turns out, mysql was already smart enough to optimize it away anyway.

@DaB: avoiding the second subquery at the cost of running a GROUP on all revisions on enwiki probably makes things worse, not better. Running group on such a huge table is expensive, and running a simple subquery on a set of relatively few pages is probably faster - this is true because page_len < 50 is a pretty strong limitation; if we had no such strong resatriction, grouping would probably be faster.


From: DaB. <dab@ts.wikimedia.org>

Date: Sun, 06 Jan 2008 23:28:51

mm, your query is a little bit slower then mine. Your query needs 10 min 26.73 sec on dewp and 1 min 24.52 sec on cswp. Any idea, why?

(That's only a question to learn ![][1])

[1]: https://jira.toolserver.org/images/icons/emoticons/smile.gif

From: Daniel Kinzler <daniel@brightbyte.de>

Date: Mon, 07 Jan 2008 00:14:00

when i tested, it was a bit faster ![][1] the problem is: caches. to get a reliable benchmark, you would have to try each on a cold cache. if you run either qiery twice, the second go is much faster. this may also make the other query faster. also, some query run by someone else just before you tried may have loaded the relevant index into the cache, etc...

[1]: https://jira.toolserver.org/images/icons/emoticons/smile.gif

From: Andreas Gasser <andreas_gasser@yahoo.com>

Date: Tue, 28 Jun 2011 10:09:48

http://toolserver.org/

This bug was imported as RESOLVED. The original assignee has therefore not been
set, and the original reporters/responders have not been added as CC, to
prevent bugspam.

If you re-open this bug, please consider adding these people to the CC list:
Original assignee: (none)
CC list: b@mzmcbride.com, wikimedia-bugzilla@dabpunkt.eu, daniel.kinzler@wikimedia.de