Page MenuHomePhabricator
Paste P7943

Unindexed pages
ActivePublic

Authored by Reedy on Dec 28 2018, 2:53 AM.
<?php
/**
* A maintenance script that allows you to see if any documents in an
* Elasticsearch index which are marked 'OK' actually do not have content
* in their associated entry within Elasticsearch. This can happen if the
* index is somehow corrupted (during upgrade?). If this type of corruptions
* happens, you'll experience a lack of search results for content that is known
* to be in the wiki.
*
* Elasticsearch will not index that 'missing' content until a new edit is
* made to the page marked as 'OK'. This script
* is possibly the only way (short of some jedi es query) for a MediaWiki admin
* to see which wiki articles are affected by this "black hole" of search.
*
* To avoid changing any content, we attempt to make a 'null edit' for pages
* affected by the situation. These edits do appear in RecentChanges, and can be
* managed with other MediaWiki administrative tools.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License along
* with this program; if not, write to the Free Software Foundation, Inc.,
* 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
* http://www.gnu.org/copyleft/gpl.html
*
* @file
* @ingroup Maintenance
*/
/**
* This script is a maintenance script for CirrusSearch
* @var [type]
*/
namespace CirrusSearch;
use CirrusSearch\Maintenance\Maintenance;
use Title;
use WikiPage;
use MediaWiki\MediaWikiServices;
$IP = getenv( 'MW_INSTALL_PATH' );
if ( $IP === false ) {
$IP = __DIR__ . '/../../..';
}
require_once "$IP/maintenance/Maintenance.php";
require_once __DIR__ . '/../includes/Maintenance/Maintenance.php';
/**
* Maintenance script to check articles to see if they are indexed by Elasticsearch.
*
* @ingroup Maintenance
*/
class PurgeUnindexedPages extends Maintenance {
public function __construct() {
parent::__construct();
$this->addDescription( 'Check articles to see if they are indexed by Elasticsearch' );
// $name, $description, $required = false, $withArg = false, $shortName = false, $multiOccurrence = false
$this->addOption(
'dry-run',
'Do not perform any corrections/edits with "-n or --dry-run"',
false,
false,
'n'
);
$this->addOption('verbose',
'List the titles with "-v or --verbose"',
false,
false,
'v'
);
$this->setBatchSize( 1000 ); // parent method; adds --batch-size option
}
public function execute() {
global $wgUser;
$start = ''; // what title starts the select statement
$numArticles = 0; // how many articles are there?
$numBad = 0; // how many bad articles did we find?
// do not edit anything
$this->mNope = $this->hasOption( 'dry-run' );
// print a list (and stats)
$this->mVerbose = $this->hasOption( 'verbose' );
$config = MediaWikiServices::getInstance()->getConfigFactory()->makeConfig( 'CirrusSearch' );
$conn = new Connection( $config );
$searcher = new Searcher( $conn, 0, 0, $config, [], $wgUser );
$db = wfGetDB( DB_REPLICA );
do {
// $table, $vars, $conds = '', $fname = __METHOD__, $options = [], $join_conds = []
$res = $db->select( 'page', [ 'page_namespace', 'page_title' ],
[ 'page_title > ' . $db->addQuotes( $start ) ] , __METHOD__,
[ 'ORDER BY' => 'page_title', 'LIMIT' => $this->getBatchSize() ]
);
foreach ( $res as $row ) {
$numArticles++;
$start = $row->page_title;
$title = Title::makeTitleSafe( $row->page_namespace, $row->page_title );
if ($title === null) {
$this->output( "unable to create title object from " .
"{$row->page_namespace}: {$row->page_title}\n" );
continue;
}
$docId = $config->makeId( $title->getArticleID() );
$esSources = $searcher->get( [ $docId ], true );
// We erroneously relied on if ( !$esSources->isOK() )
// until it was discovered that
// the bad articles were already marked 'OK'.
if ( !count($esSources->value) ) {
$numBad++;
if ( $this->mVerbose ) {
$this->output( $title->getText() . "\n" );
}
if ( !$this->mNope ) {
$page = new WikiPage( $title );
$page->doEditContent( $page->getContent(), 'This changes nothing', EDIT_UPDATE, false, $wgUser );
$this->output( $title->getText() . " fixed\n" );
}
}
}
} while ( $res->numRows() );
$this->output( "Found $numBad hidden articles out of $numArticles.\n\n" );
}
}
$maintClass = PurgeUnindexedPages::class;
require_once RUN_MAINTENANCE_IF_MAIN;

Event Timeline

Reedy created this paste.Dec 28 2018, 2:53 AM
Reedy edited the content of this paste. (Show Details)
Reedy edited the content of this paste. (Show Details)Dec 28 2018, 2:57 AM
Reedy edited the content of this paste. (Show Details)
Reedy edited the content of this paste. (Show Details)

I tried this and other variations, but can't quite figure out how to get output. If I extend Maintenance, I get a 'class not found error' for MediaWikiServices. If I extend FormlessAction (like Dump.php) I dont' get any output from the CLI (php fixSearch.php) , and get 500 error if I access the file via browser (https://wiki.freephile.org/w/maintenance/fixSearch.php).

<?php

require_once 'Maintenance.php';

class SearchMaint extends FormlessAction {

	public function onView() {
		// Disable regular results
		$this->getOutput()->disable();

		$response = $this->getRequest()->response();
		$response->header( 'Content-type: application/json; charset=UTF-8' );

		$config = MediaWikiServices::getInstance()->getConfigFactory()->makeConfig( 'CirrusSearch' );
		$conn = new Connection( $config );
		$searcher = new Searcher( $conn, 0, 0, $config, [], $wgUser );

		$db = wfGetDB( DB_REPLICA );
		$titles = $db->selectFieldValues( 'page', 'page_title' );

		foreach ( $titles as $text ) {
			$title = Title::newFromDBkey( $text );

			$docId = $config->makeId( $title->getArticleID() );
			$esSources = $searcher->get( [ $docId ], true );
			if ( !$esSources->isOK() ) {
				echo '{"$title->getText()":"bad"}';
				return null;
			} else {
				echo '{"$title->getText()":"OK"}';
				//echo $title->getText() . " is OK";
				return null;
			}
		}
	}
	/**
	 * @return string
	 */
	public function getName() {
		return 'searchdump';
	}

	/**
	 * @return bool
	 */
	public function requiresWrite() {
		return false;
	}

	/**
	 * @return bool
	 */
	public function requiresUnblock() {
		return false;
	}
}

$maintClass = "SearchMaint";
require_once RUN_MAINTENANCE_IF_MAIN;
Reedy added a comment.EditedDec 28 2018, 5:24 PM

If you extend maintenance you need to include the maintenance script entry point. It also won’t be web browseable

Doing it as an action like that would require extra registration and stuff...

Also if you return in a loop, it won’t do more than one iteration

Reedy added a comment.EditedDec 28 2018, 6:13 PM

Try something like this. Put it in a file called purgeUnindexedPages.php at extensions\CirrusSearch\maintenance

Run it like you would any other PHP script maintenance script from MW

<?php

namespace CirrusSearch;

use CirrusSearch\Maintenance\Maintenance;
use Title;
use WikiPage;
use MediaWiki\MediaWikiServices;

$IP = getenv( 'MW_INSTALL_PATH' );
if ( $IP === false ) {
	$IP = __DIR__ . '/../../..';
}
require_once "$IP/maintenance/Maintenance.php";
require_once __DIR__ . '/../includes/Maintenance/Maintenance.php';

class PurgeUnindexedPages extends Maintenance {

	public function execute() {
		global $wgUser;

		$config = MediaWikiServices::getInstance()->getConfigFactory()->makeConfig( 'CirrusSearch' );
		$conn = new Connection( $config );
		$searcher = new Searcher( $conn, 0, 0, $config, [], $wgUser );

		$db = wfGetDB( DB_REPLICA );
		$res = $db->select( 'page', [ 'page_namespace', 'page_title' ], [ 'ORDER BY' => 'page_id'], __METHOD__ );

		foreach ( $res as $row ) {
			$title = Title::makeTitleSafe( $row->page_namespace, $row->page_title );

			$docId = $config->makeId( $title->getArticleID() );
			$esSources = $searcher->get( [ $docId ], true );
			if ( !$esSources->isOK() ) {
				$page = new WikiPage( $title );
				$page->doEditContent( $page->getContent(), 'This changes nothing', EDIT_UPDATE, false, $wgUser );
				$this->output( $title->getText() . " is unindexed. Null editing\n" );
			}
		}
	}
}

$maintClass = PurgeUnindexedPages::class;
require_once RUN_MAINTENANCE_IF_MAIN;
Reedy added a comment.Dec 28 2018, 7:55 PM

Minor update because namespacing

This comment was removed by freephile.

Looking at the Elasticsearch contents of known "good" articles v. known "bad" articles (ones without index content), we find that all articles are marked "OK", but bad articles do not have any 'value' so I switched the conditional to check for an empty 'value'

e.g.
bad article Beneficial nematodes

Status Object
(

[cleanCallback] => 
[ok:protected] => 1
[errors:protected] => Array
    (
    )

[value] => Array
    (
    )

[success] => Array
    (
    )

[successCount] => 0
[failCount] => 0

)

good article Property:LastName

LastName
Status Object
(
    [cleanCallback] => 
    [ok:protected] => 1
    [errors:protected] => Array
        (
        )

    [value] => Array
        (
            [0] => Elastica\Result Object
                (
                    [_hit:protected] => Array
                        (
                            [_index] => wiki_meta_content_first
                            [_type] => page
                            [_id] => 54
                            [_score] => 1
                            [_source] => Array
                                (
                                    [version] => 108
                                    [wiki] => wiki_meta
                                    [namespace] => 102
                                    [namespace_text] => Property
                                    [title] => LastName
                                    [timestamp] => 2018-10-31T17:12:09Z
                                    [category] => Array
                                        (
                                        )

                                    [external_link] => Array
                                        (
                                        )

                                    [outgoing_link] => Array
                                        (
                                        )

                                    [template] => Array
                                        (
                                        )

                                    [text] => This is a property of type Text.
                                    [source_text] => This is a property of type [[Has type::Text]].
                                    [text_bytes] => 46
                                    [content_model] => wikitext
                                    [language] => en
                                    [heading] => Array
                                        (
                                        )

                                    [opening_text] => 
                                    [auxiliary_text] => Array
                                        (
                                        )

                                    [defaultsort] => 
                                    [redirect] => Array
                                        (
                                        )

                                    [incoming_links] => 0
                                )

                        )

                )

        )

    [success] => Array
        (
        )

    [successCount] => 0
    [failCount] => 0
)
freephile added a comment.EditedDec 30 2018, 5:17 PM

I also discovered that some (~100/564) of the edits were destructive (RecentChanges) instead of being a so-called 'null edit'. It turns out that this was due to a character encoding issue in some content -- not due to the null edit action.

The paste above now represents a beefed up script with more checks and features (e.g. --dry-run)

freephile edited the content of this paste. (Show Details)Dec 31 2018, 6:41 PM
freephile added a project: CirrusSearch.
Reedy edited the content of this paste. (Show Details)Jan 1 2019, 5:50 PM
freephile added a comment.EditedJan 2 2019, 2:03 PM

@Reedy Thanks for the review and updates.

Is there a way for this tool (Phabricator Paste) to be configured to show better diffs? The tool itself makes it appear that you did nothing but move a paragraph - but you really can't tell what changed because it's a "large" non-specific diff. Using Meld, or just plain GNU diff shows the actual single-line changes you made. This information is helpful because it teaches and reinforces coding conventions. If the tool doesn't convey that information to users, then it's working against us. (I was curious so I took the extra effort to download copies of the files and compare them.) I checked the Phab book (https://secure.phabricator.com/book/phabricator/) but there's no reference to the 'paste' application at all. I could check the sources obviously; but I'm not running an instance of Phabricator so I'm not at all familiar with the application configuration and sources.

Also, as it stands, this script will generate false positives by listing any redirect titles as pages that have no indexed content. How can we filter out redirects?

This script also can not read private namespaces -- but I think that is a problem unique to my wiki (an old experiment); thus I don't think we need a namespace filter.