Page MenuHomePhabricator

Improving Pi bot's database query for unconnected pages
Closed, ResolvedPublic

Description

I run regular SQL queries for unconnected pages on a few wikis via Pi bot, per:
https://bitbucket.org/mikepeel/wikicode/src/master/query_enwp_articles_no_wikidata.py
Specifically by running this query:

SELECT page_title FROM page WHERE page_namespace=0 AND page_is_redirect=0 AND page_id NOT IN (SELECT page_id FROM page JOIN page_props ON page_id=pp_page WHERE page_namespace=0 AND pp_propname='wikibase_item')

I recently learnt that EXPECTED_UNCONNECTED_PAGE exists as a magic word, seemingly introduced by T97577.

I would like to exclude pages that use this magic word from the database query, so that the bot runs more optimally - is this possible somehow?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I hadn't, that link is useful, thanks. I'm not sure I can parse it in terms of a change to the SQL query that I'm running though - can you help with that?

The query that special page makes is:

SELECT  page_id AS `value`,page_namespace AS `namespace`,page_title AS `title`,0 AS `page_num_iwlinks`  FROM `page` LEFT JOIN `page_props` ON ((page_id = pp_page) AND pp_propname IN ('wikibase_item','expectedUnconnectedPage') )   WHERE (page_namespace IN (0,4,10,12,14,100,828,2300,2302)) AND page_is_redirect = 0 AND (pp_propname IS NULL)  ORDER BY value DESC LIMIT 51

HTH

Mike_Peel renamed this task from Improving the database query for unconnected pages to Improving Pi bot's database query for unconnected pages.Oct 10 2022, 8:23 PM

@Ladsgroup @Lucas_Werkmeister_WMDE OK, reading through the other task (and quickly renaming this one, sorry for that confusion!), the big change is the new page_prop 'unexpectedUnconnectedPage', which I could simply query here. I need the page title rather than the ID. So probably something like this would work? (changing, e.g., -0 to -14 for categories):

SELECT page_title FROM page WHERE page_id IN (SELECT page_id FROM page JOIN page_props ON page_id=pp_page WHERE page_namespace=-0 AND pp_propname='unexpectedUnconnectedPage')

Does that make sense? Or is there a better way of doing this query?

I think it can be a simpler JOIN. and also, you should use the pp_sortkey so that the namespace filtering can already take place within the page_props index:

SELECT page_title FROM page JOIN page_props ON page_id = pp_page WHERE pp_propname = 'unexpectedUnconnectedPage' AND pp_sortkey = -0;

Then this should be pretty efficient.

(For anyone else reading along: the “other task” is T300770 and will hopefully be made public soon.)

Mike_Peel claimed this task.

Thanks, implemented now, and this seems to be working well!