Maniphest T198965

Create XML sitemaps so search engine crawlers can crawl more effectively
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	• Deskana
	Jul 6 2018, 1:06 PM

Description

Outcome from 2018 SEO project with Go Fish Digital:

Most of our sites do not have XML sitemaps. Sitemaps are important so that crawlers know how to crawl our site and understand how pages relate to each other. A sitemap also guarantees that pages are not missed by the crawler, since it'll look at everything that's in the sitemap. A lack of a sitemap is normally a gigantic problem, but Wikimedia sites are so popular that we've gotten away not having one.

Many of our sites are gigantic, and XML sitemaps can only contain a maximum of 50,000 items. To compensate for this, a kind of "sitemap of sitemaps" can be created, which would be more than sufficient to contain everything. A sitemap like this should be created.

Details

	Subject	Repo	Branch	Lines +/-
	sitemaps: Generalize varnish rule for sitemaps, to apply to all domains	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Krinkle	T198970 Epic: Implement SEO improvements suggested by Go Fish Digital
Invalid	None	T198965 Create XML sitemaps so search engine crawlers can crawl more effectively
Resolved	mpopov	T202643 Determine if creation of Italian Wikipedia sitemaps increased traffic from search engines
Resolved	• Imarlier	T205495 Enable $wgMFNoindexPages for beta
Resolved	• Imarlier	T206496 Create sitemaps for Indonesian, Portuguese, Punjabi, Dutch, and Korean Wikipedias
Resolved	mpopov	T209720 Determine impact of sitemaps on search traffic to Indonesian, Portuguese, Punjabi, Dutch, and Korean Wikipedias

Event Timeline

• Deskana created this task.Jul 6 2018, 1:06 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 6 2018, 1:06 PM

• Deskana triaged this task as Medium priority.Jul 6 2018, 1:06 PM

• Deskana moved this task from Tag to 2018 SEO project outcomes on the SEO board.

• Deskana added a parent task: T198970: Epic: Implement SEO improvements suggested by Go Fish Digital.Jul 6 2018, 1:26 PM

Google doesn't spider our site, AFAIK. They subscribe to notifications directly from the ChangeProp service AFAIK and fetch Parsoid-format content directly from RESTBase when it changes.

Maybe sitemaps would help non-Google search engines. What's our traffic level from non-Google spiders?

(FWIW: a sitemap which listed every article title on our projects would be *huge*. The list of all titles *just on english wikipedia* is 240MB *compressed*. See https://dumps.wikimedia.org/enwiki/20180701/ -- and that 240MB file would have to be rewritten (and re-fetched) every time a new article was created.)

• MZMcBride subscribed.Jul 21 2018, 4:26 AM

I think the relationship between Wikimedia Foundation Inc. and Go Fish Digital has every appearance of being highly problematic, as I outlined here: https://lists.wikimedia.org/pipermail/wikimedia-l/2018-July/090737.html. And I don't think XML site maps are going to help Wikipedia's search engine optimization. Any discussion of improving Wikipedia's SEO has long been taken as a joke given Wikipedia's existing ridiculously high placement in Google search results. That said...

In T198965#4438049, @cscott wrote:

(FWIW: a sitemap which listed every article title on our projects would be *huge*. The list of all titles *just on english wikipedia* is 240MB *compressed*. See https://dumps.wikimedia.org/enwiki/20180701/ -- and that 240MB file would have to be rewritten (and re-fetched) every time a new article was created.)

Who cares about this 240MB figure? XML site maps are limited to 50,000 URLs, so it would never be a single file anyway. And even if it were a single file, we're talking about companies that crawl and cache sizable portions of the Web. What's another 240MB or more? Modern computers are fully capable of querying large sets of data, dynamically updating URL end-points, splitting large sets of data into chunks, and downloading lots of data.

There are plenty of reasons to look upon this task skeptically, but the technical component of this is pretty straightforward and lightly documented at https://www.mediawiki.org/wiki/Manual:Sitemap.

• Deskana mentioned this in T199252: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest.Jul 30 2018, 11:27 AM

• Deskana mentioned this in T202643: Determine if creation of Italian Wikipedia sitemaps increased traffic from search engines.Aug 23 2018, 4:32 PM

• Deskana added a subtask: T202643: Determine if creation of Italian Wikipedia sitemaps increased traffic from search engines.Aug 29 2018, 2:09 PM

Change 456169 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] sitemaps: Generalize varnish rule for sitemaps, to apply to all domains

https://gerrit.wikimedia.org/r/456169

gerritbot added a project: Patch-For-Review.Aug 29 2018, 3:56 PM

In T198965#4438037, @cscott wrote:

Google doesn't spider our site, AFAIK. They subscribe to notifications directly from the ChangeProp service AFAIK and fetch Parsoid-format content directly from RESTBase when it changes.

Maybe sitemaps would help non-Google search engines. What's our traffic level from non-Google spiders?

Google does spider our sites, though relatively infrequently. When we ran into the issue with it.wikipedia.org (T199252) following the July 5 protest, it appeared that Google had crawled about 700,000 pages while the redirect was active. They re-crawled most of those pages between 30 and 35 days later.

Meanwhile, ChangeProp does not appear to be used by their search index -- if it was, content edits between July 5 and reindexing in early August would have resulted in bad cached results from it.wikipedia.org being replaced, but it doesn't appear that they were. ChangeProp is definitely being used by Google Assistant, the info box that appears on the search results page, and other locations on their sites, but the search index itself does not appear to be updated.

Based on observed behavior, Google's spider uses the sitemap as a starting point for it's crawling. In the event that the spider discovers additional pages on the site that are indexable, but that are not contained in the sitemap, it will still index those pages.

ovasileva added a subtask: T205495: Enable $wgMFNoindexPages for beta.Sep 26 2018, 6:08 PM

• Imarlier closed subtask T205495: Enable $wgMFNoindexPages for beta as Resolved.Oct 2 2018, 6:15 PM

ovasileva added a subtask: T206496: Create sitemaps for Indonesian, Portuguese, Punjabi, Dutch, and Korean Wikipedias.Oct 9 2018, 12:10 AM

ovasileva added a subtask: T206497: Enable $wgMFNoindexPages for: Italian, Dutch, Korean, Arabic, Chinese, and Hindi Wikipedias.Oct 9 2018, 12:23 AM

• Imarlier changed the status of subtask T206496: Create sitemaps for Indonesian, Portuguese, Punjabi, Dutch, and Korean Wikipedias from Open to Stalled.Oct 9 2018, 3:17 PM

Change 456169 abandoned by Imarlier:
sitemaps: Generalize varnish rule for sitemaps, to apply to all domains

Reason:
Opening a new CR with the specific sites for which we're generating sitemaps

https://gerrit.wikimedia.org/r/456169

mpopov closed subtask T202643: Determine if creation of Italian Wikipedia sitemaps increased traffic from search engines as Resolved.Oct 10 2018, 3:42 PM

@ovasileva @mpopov I'm ready to move forward with this whenever you guys are. Just let me know.

ovasileva closed subtask T209720: Determine impact of sitemaps on search traffic to Indonesian, Portuguese, Punjabi, Dutch, and Korean Wikipedias as Resolved.Jun 20 2019, 7:41 PM

Maintenance_bot removed a project: Patch-For-Review.Jun 20 2019, 8:10 PM

ovasileva closed subtask T206497: Enable $wgMFNoindexPages for: Italian, Dutch, Korean, Arabic, Chinese, and Hindi Wikipedias as Resolved.Oct 7 2019, 10:20 AM

ovasileva removed a subtask: T206497: Enable $wgMFNoindexPages for: Italian, Dutch, Korean, Arabic, Chinese, and Hindi Wikipedias.

VulpesVulpes825 mentioned this in T54429: Canonical URL should include language variant.Jul 5 2020, 8:03 AM

VulpesVulpes825 mentioned this in T108443: Google doesn't honor canonical URLs of zh.wiki.

Aklapper added a project: WMF-General-or-Unknown.Nov 27 2021, 4:15 PM

As most suggestions from Go Fish, this is invalid and should not be worked on for any of the states reasons. There may be other tasks that are overlap for other reasons, but let's close this out.

In addition, we did (unfortunately) spend significant amounts of time trying this out. Perf Team has since decommissioned that experiment, as there was no statistically significant evidence of SEO improvement.

AndyRussG mentioned this in T298723: Bing Webmaster Tools access request for Andrew Green.Feb 4 2022, 6:12 PM

Krinkle mentioned this in T332101: determine whether https://sitemaps.wikimedia.org still serves a purpose.Mar 16 2023, 8:11 AM

https://sitemaps.wikimedia.org has been deleted today from Varnish, ATS and DNS.

see T332101 and T338064

Create XML sitemaps so search engine crawlers can crawl more effectivelyClosed, InvalidPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Create XML sitemaps so search engine crawlers can crawl more effectively
Closed, InvalidPublic
Actions

Related Objects
Search...