Page MenuHomePhabricator

Browsers may remember redirects for techblog.wikimedia.org
Open, MediumPublic

Description

Until 2020-03-25 (T246507: Setup DNS to direct techblog.wikimedia.org to new Wordpress VIP hosting), the URL https://techblog.wikimedia.org/ was a CNAME to a server in the Wikimedia internal server farm. On that server Apache was returning 301 Moved Permanently responses directing the visiting browser to https://blog.wikimedia.org/c/technology (T181878: techblog.wikimedia.org should redirect to blog.wikimedia.org/c/technology).

We now have DNS for https://techblog.wikimedia.org/ setup as a CNAME to techblog-wikimedia-org.go-vip.net. On that server we are running a new WordPress site. This works great if your local browser has not cached the prior 301 redirect. However modern browsers try to remember 301 redirects in their local cache to reduce back and forth trips to the remote servers. This means that some people will still be sent to the old https://blog.wikimedia.org/c/technology location rather than to the new blog, if they are following a link that is exactly the same as one that they followed when last visiting a path under techblog.wikimedia.org.

Event Timeline

There really is not an easy technical solution that we can apply for everyone on this. Mostly I wrote this task so we could point people at it if they are experiencing this problem. The steps they could take are:

  1. Clear local browser cache entirely or at least for the techblog.wikimedia.org host
  2. Add something to the URL that will make it unique (and thus avoiding local cache matching) which the new WordPress site will ignore. A common technique for this is appending a query string as a "cache busting" token. For example: https://techblog.wikimedia.org/?T248598 should not match local cache for anyone anywhere.
bd808 triaged this task as Medium priority.Mar 27 2020, 3:34 PM
bd808 added a project: Upstream.

This cached 301 behavior may also extend to web crawlers. We should try to keep an eye on google and other search engine activity related to new blog posts to see if that is true. There maybe interventions that can be done through proprietary tools for particular crawlers to act as the equivalent of cache clearing.

This cached 301 behavior may also extend to web crawlers. We should try to keep an eye on google and other search engine activity related to new blog posts to see if that is true. There maybe interventions that can be done through proprietary tools for particular crawlers to act as the equivalent of cache clearing.

GoogleBot at least seems to be doing ok with this. I searched for "Saying no to proprietary code in production is hard work: the GPU chapter" and as hoped the top hit is https://techblog.wikimedia.org/2020/04/06/saying-no-to-proprietary-code-in-production-is-hard-work-the-gpu-chapter/

GoogleBot at least seems to be doing ok with this. I searched for "Saying no to proprietary code in production is hard work: the GPU chapter" and as hoped the top hit is https://techblog.wikimedia.org/2020/04/06/saying-no-to-proprietary-code-in-production-is-hard-work-the-gpu-chapter/

I tried this with Bing, DuckDuckGo, Yahoo, AOL search, Ask.com, Baidu and Yandex.

It seems they all have the post as the top result apart from Baidu and Yandex which both haven't indexed that post but have index others (you can see them if you search for 'site:techblog.wikimedia.org')