Event Timeline
I tried this and other variations, but can't quite figure out how to get output. If I extend Maintenance, I get a 'class not found error' for MediaWikiServices. If I extend FormlessAction (like Dump.php) I dont' get any output from the CLI (php fixSearch.php) , and get 500 error if I access the file via browser (https://wiki.freephile.org/w/maintenance/fixSearch.php).
<?php require_once 'Maintenance.php'; class SearchMaint extends FormlessAction { public function onView() { // Disable regular results $this->getOutput()->disable(); $response = $this->getRequest()->response(); $response->header( 'Content-type: application/json; charset=UTF-8' ); $config = MediaWikiServices::getInstance()->getConfigFactory()->makeConfig( 'CirrusSearch' ); $conn = new Connection( $config ); $searcher = new Searcher( $conn, 0, 0, $config, [], $wgUser ); $db = wfGetDB( DB_REPLICA ); $titles = $db->selectFieldValues( 'page', 'page_title' ); foreach ( $titles as $text ) { $title = Title::newFromDBkey( $text ); $docId = $config->makeId( $title->getArticleID() ); $esSources = $searcher->get( [ $docId ], true ); if ( !$esSources->isOK() ) { echo '{"$title->getText()":"bad"}'; return null; } else { echo '{"$title->getText()":"OK"}'; //echo $title->getText() . " is OK"; return null; } } } /** * @return string */ public function getName() { return 'searchdump'; } /** * @return bool */ public function requiresWrite() { return false; } /** * @return bool */ public function requiresUnblock() { return false; } } $maintClass = "SearchMaint"; require_once RUN_MAINTENANCE_IF_MAIN;
If you extend maintenance you need to include the maintenance script entry point. It also won’t be web browseable
Doing it as an action like that would require extra registration and stuff...
Also if you return in a loop, it won’t do more than one iteration
Try something like this. Put it in a file called purgeUnindexedPages.php at extensions\CirrusSearch\maintenance
Run it like you would any other PHP script maintenance script from MW
<?php namespace CirrusSearch; use CirrusSearch\Maintenance\Maintenance; use Title; use WikiPage; use MediaWiki\MediaWikiServices; $IP = getenv( 'MW_INSTALL_PATH' ); if ( $IP === false ) { $IP = __DIR__ . '/../../..'; } require_once "$IP/maintenance/Maintenance.php"; require_once __DIR__ . '/../includes/Maintenance/Maintenance.php'; class PurgeUnindexedPages extends Maintenance { public function execute() { global $wgUser; $config = MediaWikiServices::getInstance()->getConfigFactory()->makeConfig( 'CirrusSearch' ); $conn = new Connection( $config ); $searcher = new Searcher( $conn, 0, 0, $config, [], $wgUser ); $db = wfGetDB( DB_REPLICA ); $res = $db->select( 'page', [ 'page_namespace', 'page_title' ], [ 'ORDER BY' => 'page_id'], __METHOD__ ); foreach ( $res as $row ) { $title = Title::makeTitleSafe( $row->page_namespace, $row->page_title ); $docId = $config->makeId( $title->getArticleID() ); $esSources = $searcher->get( [ $docId ], true ); if ( !$esSources->isOK() ) { $page = new WikiPage( $title ); $page->doEditContent( $page->getContent(), 'This changes nothing', EDIT_UPDATE, false, $wgUser ); $this->output( $title->getText() . " is unindexed. Null editing\n" ); } } } } $maintClass = PurgeUnindexedPages::class; require_once RUN_MAINTENANCE_IF_MAIN;
Looking at the Elasticsearch contents of known "good" articles v. known "bad" articles (ones without index content), we find that all articles are marked "OK", but bad articles do not have any 'value' so I switched the conditional to check for an empty 'value'
e.g.
bad article Beneficial nematodes
Status Object
(
[cleanCallback] => [ok:protected] => 1 [errors:protected] => Array ( ) [value] => Array ( ) [success] => Array ( ) [successCount] => 0 [failCount] => 0
)
good article Property:LastName
LastName Status Object ( [cleanCallback] => [ok:protected] => 1 [errors:protected] => Array ( ) [value] => Array ( [0] => Elastica\Result Object ( [_hit:protected] => Array ( [_index] => wiki_meta_content_first [_type] => page [_id] => 54 [_score] => 1 [_source] => Array ( [version] => 108 [wiki] => wiki_meta [namespace] => 102 [namespace_text] => Property [title] => LastName [timestamp] => 2018-10-31T17:12:09Z [category] => Array ( ) [external_link] => Array ( ) [outgoing_link] => Array ( ) [template] => Array ( ) [text] => This is a property of type Text. [source_text] => This is a property of type [[Has type::Text]]. [text_bytes] => 46 [content_model] => wikitext [language] => en [heading] => Array ( ) [opening_text] => [auxiliary_text] => Array ( ) [defaultsort] => [redirect] => Array ( ) [incoming_links] => 0 ) ) ) ) [success] => Array ( ) [successCount] => 0 [failCount] => 0 )
I also discovered that some (~100/564) of the edits were destructive (RecentChanges) instead of being a so-called 'null edit'. It turns out that this was due to a character encoding issue in some content -- not due to the null edit action.
The paste above now represents a beefed up script with more checks and features (e.g. --dry-run)
@Reedy Thanks for the review and updates.
Is there a way for this tool (Phabricator Paste) to be configured to show better diffs? The tool itself makes it appear that you did nothing but move a paragraph - but you really can't tell what changed because it's a "large" non-specific diff. Using Meld, or just plain GNU diff shows the actual single-line changes you made. This information is helpful because it teaches and reinforces coding conventions. If the tool doesn't convey that information to users, then it's working against us. (I was curious so I took the extra effort to download copies of the files and compare them.) I checked the Phab book (https://secure.phabricator.com/book/phabricator/) but there's no reference to the 'paste' application at all. I could check the sources obviously; but I'm not running an instance of Phabricator so I'm not at all familiar with the application configuration and sources.
Also, as it stands, this script will generate false positives by listing any redirect titles as pages that have no indexed content. How can we filter out redirects?
This script also can not read private namespaces -- but I think that is a problem unique to my wiki (an old experiment); thus I don't think we need a namespace filter.