Page MenuHomePhabricator

Create XML sitemaps so search engine crawlers can crawl more effectively
Open, NormalPublic

Description

Outcome from 2018 SEO project with Go Fish Digital:

Most of our sites do not have XML sitemaps. Sitemaps are important so that crawlers know how to crawl our site and understand how pages relate to each other. A sitemap also guarantees that pages are not missed by the crawler, since it'll look at everything that's in the sitemap. A lack of a sitemap is normally a gigantic problem, but Wikimedia sites are so popular that we've gotten away not having one.

Many of our sites are gigantic, and XML sitemaps can only contain a maximum of 50,000 items. To compensate for this, a kind of "sitemap of sitemaps" can be created, which would be more than sufficient to contain everything. A sitemap like this should be created.

Event Timeline

Deskana created this task.Jul 6 2018, 1:06 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 6 2018, 1:06 PM
Deskana triaged this task as Normal priority.Jul 6 2018, 1:06 PM
Deskana moved this task from Tag to 2018 SEO project outcomes on the SEO board.
cscott added a subscriber: cscott.Jul 19 2018, 12:40 PM

Google doesn't spider our site, AFAIK. They subscribe to notifications directly from the ChangeProp service AFAIK and fetch Parsoid-format content directly from RESTBase when it changes.

Maybe sitemaps would help non-Google search engines. What's our traffic level from non-Google spiders?

(FWIW: a sitemap which listed every article title on our projects would be *huge*. The list of all titles *just on english wikipedia* is 240MB *compressed*. See https://dumps.wikimedia.org/enwiki/20180701/ -- and that 240MB file would have to be rewritten (and re-fetched) every time a new article was created.)

I think the relationship between Wikimedia Foundation Inc. and Go Fish Digital has every appearance of being highly problematic, as I outlined here: https://lists.wikimedia.org/pipermail/wikimedia-l/2018-July/090737.html. And I don't think XML site maps are going to help Wikipedia's search engine optimization. Any discussion of improving Wikipedia's SEO has long been taken as a joke given Wikipedia's existing ridiculously high placement in Google search results. That said...

(FWIW: a sitemap which listed every article title on our projects would be *huge*. The list of all titles *just on english wikipedia* is 240MB *compressed*. See https://dumps.wikimedia.org/enwiki/20180701/ -- and that 240MB file would have to be rewritten (and re-fetched) every time a new article was created.)

Who cares about this 240MB figure? XML site maps are limited to 50,000 URLs, so it would never be a single file anyway. And even if it were a single file, we're talking about companies that crawl and cache sizable portions of the Web. What's another 240MB or more? Modern computers are fully capable of querying large sets of data, dynamically updating URL end-points, splitting large sets of data into chunks, and downloading lots of data.

There are plenty of reasons to look upon this task skeptically, but the technical component of this is pretty straightforward and lightly documented at https://www.mediawiki.org/wiki/Manual:Sitemap.

Change 456169 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[operations/puppet@production] sitemaps: Generalize varnish rule for sitemaps, to apply to all domains

https://gerrit.wikimedia.org/r/456169

Google doesn't spider our site, AFAIK. They subscribe to notifications directly from the ChangeProp service AFAIK and fetch Parsoid-format content directly from RESTBase when it changes.
Maybe sitemaps would help non-Google search engines. What's our traffic level from non-Google spiders?

Google does spider our sites, though relatively infrequently. When we ran into the issue with it.wikipedia.org (T199252) following the July 5 protest, it appeared that Google had crawled about 700,000 pages while the redirect was active. They re-crawled most of those pages between 30 and 35 days later.

Meanwhile, ChangeProp does not appear to be used by their search index -- if it was, content edits between July 5 and reindexing in early August would have resulted in bad cached results from it.wikipedia.org being replaced, but it doesn't appear that they were. ChangeProp is definitely being used by Google Assistant, the info box that appears on the search results page, and other locations on their sites, but the search index itself does not appear to be updated.

Based on observed behavior, Google's spider uses the sitemap as a starting point for it's crawling. In the event that the spider discovers additional pages on the site that are indexable, but that are not contained in the sitemap, it will still index those pages.

Change 456169 abandoned by Imarlier:
sitemaps: Generalize varnish rule for sitemaps, to apply to all domains

Reason:
Opening a new CR with the specific sites for which we're generating sitemaps

https://gerrit.wikimedia.org/r/456169

Imarlier added subscribers: ovasileva, mpopov.

@ovasileva @mpopov I'm ready to move forward with this whenever you guys are. Just let me know.