Page MenuHomePhabricator

generateSitemap.php should remove __NOINDEX__ pages added via $wgNamespaceRobotPolicies or $wgDefaultRobotPolicies in LocalSettings.php
Open, LowestPublic

Description

The following patches for the maintenance script generateSitemap.php from https://gerrit.wikimedia.org/r/c/620746 works (removed noindex pages from sitemap file) only for the behavior switch magic word (___NOINDEX___), but does not remove pages marked 'noindex' via the LocalSettings.php from the generated sitemap file.

I think there might be a solution to this because, if there wasn't, Wikipedia would have a problem excluding talkpages from its sitemap, which I think it doesn't: https://en.wikipedia.org/wiki/Wikipedia:Controlling_search_engine_indexing

Now, the wiki in question is by default noindex. Pages that are to be index have {{INDEX}} added to them but the entire wiki is noindex by default, because: $wgDefaultRobotPolicies = true; in LocalSettings.php. Thus the desire sitemap solution is to generate sitemap for pages that has ___INDEX___ or {{INDEX}} in them or that indicate 'index' in the HTML output of the page.

diff --git a/maintenance/generateSitemap.php b/maintenance/generateSitemap.php
index 6060567..bc5e865 100644
--- a/maintenance/generateSitemap.php
+++ b/maintenance/generateSitemap.php

@@ -305,15 +305,27 @@
 	 * @return IResultWrapper
 	 */
 	private function getPageRes( $namespace ) {
-		return $this->dbr->select( 'page',
+		return $this->dbr->select(
+			[ 'page', 'page_props' ],
 			[
 				'page_namespace',
 				'page_title',
 				'page_touched',
-				'page_is_redirect'
+				'page_is_redirect',
+				'pp_propname',
 			],
 			[ 'page_namespace' => $namespace ],
-			__METHOD__
+			__METHOD__,
+			[],
+			[
+				'page_props' => [
+					'LEFT JOIN',
+					[
+						'page_id = pp_page',
+						'pp_propname' => 'noindex'
+					]
+				]
+			]
 		);
 	}
 
@@ -335,7 +347,13 @@
 			$fns = $contLang->getFormattedNsText( $namespace );
 			$this->output( "$namespace ($fns)\n" );
 			$skippedRedirects = 0; // Number of redirects skipped for that namespace
+			$skippedNoindex = 0; // Number of pages with __NOINDEX__ switch for that NS
 			foreach ( $res as $row ) {
+				if ( $row->pp_propname === 'noindex' ) {
+					$skippedNoindex++;
+					continue;
+				}
+
 				if ( $this->skipRedirects && $row->page_is_redirect ) {
 					$skippedRedirects++;
 					continue;
@@ -380,6 +398,10 @@
 				}
 			}
 
+			if ( $skippedNoindex > 0 ) {
+				$this->output( "  skipped $skippedNoindex page(s) with __NOINDEX__ switch\n" );
+			}
+
 			if ( $this->skipRedirects && $skippedRedirects > 0 ) {
 				$this->output( "  skipped $skippedRedirects redirect(s)\n" );
 			}

Event Timeline

Godman renamed this task from Current patches for the maintenance script generateSitemap.php doesn't removed NOINDEX pages that are added via $wgNamespaceRobotPolicies or $wgDefaultRobotPolicies in LocalSettings.php to Current patches for the maintenance script generateSitemap.php doesn't REMOVED NOINDEX pages that are added via $wgNamespaceRobotPolicies or $wgDefaultRobotPolicies in LocalSettings.php FROM SITEMAP.Nov 3 2020, 2:24 PM
Godman updated the task description. (Show Details)
Godman updated the task description. (Show Details)
Godman triaged this task as High priority.Nov 3 2020, 4:18 PM
Aklapper raised the priority of this task from High to Needs Triage.Nov 3 2020, 4:43 PM
Aklapper removed a project: Patch-For-Review.
Aklapper updated the task description. (Show Details)

Hi @Godman, thanks for taking a look at the code!

You are very welcome to use developer access to submit the proposed code changes as a Git branch directly into Gerrit which makes it easier to review and provide feedback. If you don't want to set up Git/Gerrit, you can also use the Gerrit Patch Uploader. Thanks again!

I'm also resetting the task priority.

(Removing Community-Tech as it is up to teams what they'd like to have on their workboard.)

Aklapper renamed this task from Current patches for the maintenance script generateSitemap.php doesn't REMOVED NOINDEX pages that are added via $wgNamespaceRobotPolicies or $wgDefaultRobotPolicies in LocalSettings.php FROM SITEMAP to generateSitemap.php should remove __NOINDEX__ pages added via $wgNamespaceRobotPolicies or $wgDefaultRobotPolicies in LocalSettings.php.Nov 16 2020, 12:29 PM
Aklapper triaged this task as Lowest priority.
Pppery subscribed.

Removing Wikimedia-maintenance-script-run as I don't think Wikimedia wikis use sitemaps anymore.