Page MenuHomePhabricator

determine whether https://sitemaps.wikimedia.org still serves a purpose
Closed, ResolvedPublic

Description

https://sitemaps.wikimedia.org/ is one of the micro sites hosted on our legacy miscweb cluster.

In our admin module we have a group of shell users called sitemap-admins who can upload files to it.

It seems though as if this has happened the last time in 2018, looking at the timestamps.

This task is to question whether sitemaps.wikimedia.org still serves a purpose and whether it should stay around.

If yes, question would be if the sitemaps need an update.

If no, we would remove the entire virtual host and the admin group.

Unless it's really "this needs to stay around but also doesn't get updates anymore", then we would keep hosting it and move it to replacement miscweb machines or the miscweb k8s service.

Event Timeline

Dzahn added subscribers: LSobanski, Peter, aaron, Krinkle.

Adding the users who have uploader privileges per:

sitemaps-admins:
  gid: 805
  description: People who upload files to sitemaps.wikimedia.org
  members: [krinkle, phedenskog, aaron]

@Krinkle @Peter @aaron and cc: @LSobanski

My first thought was that sitemaps made sense back in the days when dinosaurs roamed the internet. Then I looked at https://developers.google.com/search/docs/crawling-indexing/sitemaps/overview to refresh my memory of why they exist. The main purpose is to enable web crawlers to find pages which aren't reachable by following the link tree from your main page.

On wikipedia, those kinds of pages are called orphans. We have (at least on enwiki) [[Category:Orphaned articles]], which does get crawled. For example, [[Abdulkareem Mohammad Jamiu]] is a recent orphan, and Google has already found it. So it seems to me there's no need for a sitemap.

I'm less familiar with projects outside of enwiki, so I can't speak for how this works on those.

@RoySmith Thank you for the feedback! Appreciate the details on orphaned pages and the example to test.

We have (at least on enwiki) [[Category:Orphaned articles]], which does get crawled. For example, [[Abdulkareem Mohammad Jamiu]] is a recent orphan, and Google has already found it. So it seems to me there's no need for a sitemap.

I'm less familiar with projects outside of enwiki, so I can't speak for how this works on those.

Wikipedia runs on the MediaWiki platform automatically provides https://en.wikipedia.org/wiki/Special:Lonelypages which is effectively the same as that category, but exists on all wikis.

For search engines we also provide something even faster, namely Special:RecentChanges and EventStreams. Google and other search engines uses this to find out about our edits and articles in real-time through special means, so they don't have to rely on a periodic recursive site crawl.

I wrote a blog post about this recently at https://timotijhof.net/posts/2022/internet-archive-crawling/, showing how Internet Archive (also) crawls Wikipedia and its outgoing links based on EventStreams.

From the task description:

sitemaps.wikimedia.org is one of the micro sites […].

we have a group of shell users called sitemap-admins who can upload files to it.

this has happened the last time in 2018, […]

Indeed, there are no files at https://sitemaps.wikimedia.org/ newer than 2018, and it exists only for a small number of wikis. It was commissioned by the Product department as part of SEO related experiments under T198965, and carried out as personal interest project by @Imarlier (formerly in the Performance Team).

The impact on search referal traffic in the months that followed was analyzed by @mpopov later that year and published at https://www.mediawiki.org/wiki/Reading/Search_Engine_Optimization/Sitemaps_test

This extended study is a follow-up to the inconclusive analysis of an earlier effort on Italian Wikipedia. We generated sitemaps for Indonesian, Korean, Dutch, Punjabi, and Portuguese Wikipedias and analyzed search-referred traffic to those wikis. Our thorough analysis and statistical models yielded inconclusive results.

As such, the experiment was eventually abandoned and not continued. See also the closing at T198965#7576738.

Lacking any other owner, I believe it's within Perf Team scope (having inherited it from Ian, and with us as only server access contacts), to approve immediate decommission of the sitemaps microsite, along with the Varnish proxy rule that exposes it at urls such as https://nl.wikipedia.org/sitemap.xml, which are already not advertised in robots.txt.

If and when future experimentation is desired with this feature, the main cost will be dusting off the sitemaps generation scripts and upload logic, which have already been turned off and become unmantained. The static file host and frontend forward rule would be easy to re-create at any time.

Mentioned in SAL (#wikimedia-operations) [2023-03-22T18:12:38Z] <mutante> rsyncing /srv/org/wikimedia/sitemaps files for https://sitemaps.wikimedia.org from old to new machines. most other things are auto-deployed by puppet or puppet running intial scap or automatic rsync.. this is not. rsync -av /srv/org/wikimedia/sitemaps/ rsync://miscweb2003.codfw.wmnet/miscapps-srv/org/wikimedia/sitemaps/ T331896 - but also see T332101

@Krinkle your detailed response was awesome and much appreciated!

As part of SRE sprint week we are trying to upgrade underlying VMs that host this among other services to bullseye. I noticed the sitemaps files are one of the few things that need a manual data transfer.

It wasn't hard to do because rsyncd, firewall etc is all set up by puppet but it is still a manual command.

So this is moving to new machines but I copied the files.

Dzahn triaged this task as Medium priority.Mar 23 2023, 5:43 PM
LSobanski lowered the priority of this task from Medium to Low.
LSobanski moved this task from Work in Progress to Backlog on the collaboration-services board.

Silly question: @Krinkle Given that robots.txt has a Disallow: /wiki/Special: rule, how do search engines read the LonelyPages or RecentChanges pages? As far as I can tell these pages aren't being indexed. Am I missing something?

@Krinkle Could you please answer the question I had above when you have a chance?

I can't tell how search engines are expected to find unlinked articles if the pages displaying these articles are in the Special: namespace, which happens to be robots.txt-disallowed. What am I missing?

Silly question: @Krinkle Given that robots.txt has a Disallow: /wiki/Special: rule, how do search engines read the LonelyPages or RecentChanges pages? As far as I can tell these pages aren't being indexed. Am I missing something?

@Krinkle Could you please answer the question I had above when you have a chance?

I can't tell how search engines are expected to find unlinked articles if the pages displaying these articles are in the Special: namespace, which happens to be robots.txt-disallowed. What am I missing?

Sorry, I missed your question originally. I feel this question is somewhat off-topic for this task. But, I understand that if we found some pages lacking search index coverage, that sitemaps might be among possible solutions indeed. There's actually several layers to answering your question. I imagine the last one is of most interest, but I'll start unwrapping this union from the outside first in order to offer context and connect the dots.

The entries in robots.txt for /wiki/Special: are a Wikipedia-specific optimisation on our servers, to protect us against continous floods of unwanted bot traffic. If you open a link like https://en.wikipedia.org/wiki/Special:RecentChanges, the authoritative HTML response from the MediaWiki software also responds with <meta name="robots" content="noindex,nofollow">. Hence the robots.txt entry itself is merely an optimisation that allows bots to discover this instruction without needing to request each matching URL from our servers.

The reason this optimisation is significant, is that many of our special pages are effectively query generators (base class QueryPage) which offer near-infinite URL permutations through custom filters, limits, pagination, and offsets. On top of that, the most advanced configurations of these (such as the filters on Special:RecentChanges) are known to be slow and essentially for power users only (it's there, but it's not meant to be fast). I'll explain below why these pages (should) have no value for our search index coverage. But, that's not why we excluded them. We exclude them because it's of value to us to not respond to to the majority of well-intended bots that merely follow arbitrary links. These query pages typically have either no caching or very short-lived caching, and are unlikely get a cache hit either way given their long tail of variants. Without the robots.txt optimisation, bots would interpret each permutation as its own unique URL that is assumed indexable until it requests it from our servers, only to learns that indeed the Nth variation too is not indexable.

Our built-in search engine at Special:Search, also falls under this principle. Although this one is fast, capable and with tight constraints nowadays, it remains (probably?) of negative value in user experience if our search engine result pages were to show up as results in other search engines. Could we distinguish between robots-follow and robots-index? Yes. But, there's a few issues with doing this:

  • It would cause problems due to the expensive nature of most special pages (PM me for examples of restricted tickets that expand on this).
  • It would be insufficient. We know from our own community of bots on Toolforge trying to poll recent changes, that this strategy would leave large gaps. The edit frequency on the largest wikis, e.g. Wikipedia and Wikidata, produce far more than one page's worth of activity, even at a crawling rate of once every minute. I expect search engines to generally crawl less often than this. Unless custom logic is deployed for a given domain, which brings me to the next point.
  • We already have RCFeed as scalable strategy to search engines, archivers, and other interested parties to keep up specifically with Wikipedia and other public WMF wikis. The current incarnation of RCFeed in prod uses HTTP-SSE EventSteams (previously, we offered a custom XML feed, T82353). To my knowledge, Google, Internet Archive, and other crawlers use this as their primary means of discovering creations and revisions to articles, as well as to discover new outgoing external links. I wrote a blogpost about How Archive.org crawls Wikipedia. I imagine this is combined with information from Wikidata, which may point to specific Wikipedia URLs from contextual graph points even if for any reason they were not yet indexed.

The practical outcome is that when a Wikipedia article is created or updated, its content is reflected in Google within a minute or two. I've repeatedly experienced this myself as an editor, when creating pages for fairly fringe subjects, it then also immediately becomes the top result (presumably due Wikipedia's heavy domain rank, or its present-day equivalent).

For third-party MediaWiki installs, if their natural link graph is insufficient for search engines, they can setup sitemaps indeed. Sitemaps are a built-in MediaWiki feature for this reason (T4320), but we do not deploy them for Wikipedia. I believe for a site as large as Wikipedia, polling sitemaps would involve longer delays than we prefer, and require much over-indexing to detect changes. Interested parties that are okay with lower update frequencies can presumably use our monthly dumps instead, which are in a sense like sitemaps, but with the content included!

T300171#8897348 says this is already decided? is it?

Yes, a few comments up in T332101#8701245 I documented the decision:

Indeed, there are no files at https://sitemaps.wikimedia.org/ newer than 2018 […]
[…], the experiment was abandoned and not continued. […]

Lacking any other owner, I believe it's within Perf Team scope ([…] as only server access contacts), to approve immediate decommission of the sitemaps […], which are already not advertised in robots.txt. […]

Ok, thanks for clarifying and closing this, Krinkle. Will continue with the decom'ing. Cheers

Change 926605 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] trafficserver: remove map for sitemaps.wikimedia.org

https://gerrit.wikimedia.org/r/926605

Change 926611 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] varnish: remove rewrites and tests for sitemaps.wikimedia.org

https://gerrit.wikimedia.org/r/926611

Change 926611 merged by Ssingh:

[operations/puppet@production] varnish: remove rewrites and tests for sitemaps.wikimedia.org

https://gerrit.wikimedia.org/r/926611

Change 926605 merged by Dzahn:

[operations/puppet@production] trafficserver: remove map for sitemaps.wikimedia.org

https://gerrit.wikimedia.org/r/926605