Page MenuHomePhabricator

Decom www.$lang hostnames/redirects
Closed, ResolvedPublic

Description

We generically create www sub-domainnames for all desktop language domainnames @ https://github.com/wikimedia/operations-dns/blob/master/templates/helpers/langlist.tmpl#L5

None of these match our SSL cert wildcards. They're insecure redirects that we should probably delete from DNS and from apache redirects, but I'd like to get some impact input before just blindly killing them on my own...

Event Timeline

BBlack raised the priority of this task from to Needs Triage.
BBlack updated the task description. (Show Details)
BBlack added projects: acl*sre-team, Traffic, HTTPS.
BBlack added subscribers: BBlack, Aklapper.

Yes, related. This is an actual task though (as in an action to take), rather than just a related problem-statement.

Change 218909 had a related patch set uploaded (by BBlack):
Remove www.$lang DNS T102815

https://gerrit.wikimedia.org/r/218909

Noted on irc, tons of results in:

01:40 < Mjbmr> https://www.google.com/search?q=site:www.en.wikipedia.org

I wonder why google ignores the fact that those redirect and have rel=canonical in the content? Can we get google to get rid of these before we kill them?

Those links seem to be all over the place (especially in print, some people think it's better to put www. in front of every domain to make it clear that it's a website …):

Noted on irc, tons of results in:

01:40 < Mjbmr> https://www.google.com/search?q=site:www.en.wikipedia.org

I wonder why google ignores the fact that those redirect and have rel=canonical in the content? Can we get google to get rid of these before we kill them?

Maybe we need to ask Google to re-crawl these URLs: https://support.google.com/webmasters/answer/6065812?hl=en. Probably we shouldn't merge https://gerrit.wikimedia.org/r/#/c/219121/1, since Google wrote [1]:

Don't use the robots.txt file for canonicalization purposes.

[1] https://support.google.com/webmasters/answer/139066?hl=en

Well, I think maybe we can re-visit the detailed advice given in https://support.google.com/webmasters/answer/139066?hl=en from above and go audit that we're not doing something obscure wrong, but... AFAIK addresses like http://www.en.wikipedia.org/ have for a long time (perhaps forever) been pure 301's to content with correct rel=canonical, so I think we've been playing by the rules and it's getting us nowhere... Also, there are other search engines to consider as well, for some of which robots may be the only answer.

While taking a quick peek: perhaps not directly related? But I did notice our canonical URLs only seem to canonicalize the protocol and domain, not the rest of the URL. e.g. accessing via https://en.wikipedia.org/wiki/Foobar and https://en.wikipedia.org/?title=Foobar produce different rel=canonical values (re-using the path of the request URL).

Getting rid of them completely would be ideal, but if it turns out to not be possible, why can't we just handle those and 301 them to the proper www-less https URL?

It doesn't change things a whole lot from today (which also redirects them, just to www-less first and then HTTPS after). The core issue is that we can't HSTS the hostname without a valid cert, so those links will always be insecure redirects wherever they exist (in browser history, search results, etc).

@Krinkle said elsewhere:

And by default the www.en.wikipedia entries are not included https://www.google.co.uk/search?q=site:wikipedia.org+intitle:%22ANAPROF+2003%22
I couldn't find any index hits from www.en.wikipedia.org that come up when looking for the title only. So it seems deduplicated properly. They just expose it when looking for that specific site

If we don't think "real" searches are hitting this, it may be possible to simply remove them, then.

I took a quick 2-minute log of such hits on just one prominent frontend cache (totally statistically invalid, but still...), with the URL-Path, Host-header, and Referer noted, and counted 15 distinct hits, almost all with Yahoo Search as the referer. Extrapolating loosely based on that machine's percentage of total reqs, we're looking at something on the order of 1.25 reqs/sec globally, or 0.003125% of all text-cluster requests.

The 15 logged requests (number on the left is effectively a request-id in this context):

root@cp1065:~# time varnishlog -c -n frontend -m 'RxHeader:Host: www\.en\.wiki' |egrep --line-buffered '(Host:|RxURL|Referer:)'
   86 RxURL        c /wiki/Monica_Potter
   86 RxHeader     c Host: www.en.wikipedia.org
   86 RxHeader     c Referer: http://search.yahoo.com/search?p=monica+potter&fr=iphone&.tsrc=apple&pcarrier=AT%26T&pmcc=310&pmnc=410
   52 RxURL        c /wiki/Jurassic_World
   52 RxHeader     c Host: www.en.wikipedia.org
   52 RxHeader     c Referer: https://search.yahoo.com/
   28 RxURL        c /wiki/Billy_Redden
   28 RxHeader     c Host: www.en.wikipedia.org
   28 RxHeader     c Referer: https://search.yahoo.com/
   46 RxURL        c /wiki/Here_Comes_the_Boom
   46 RxHeader     c Host: www.en.wikipedia.org
   46 RxHeader     c Referer: https://search.yahoo.com/
   72 RxURL        c /wiki/What_to_Expect_When_You%27re_Expecting_%28film%29
   72 RxHeader     c Host: www.en.wikipedia.org
   72 RxHeader     c Referer: https://search.yahoo.com/
   29 RxURL        c /wiki/Dwayne_Johnson
   29 RxHeader     c Host: www.en.wikipedia.org
   29 RxHeader     c Referer: https://search.yahoo.com/
   66 RxURL        c /wiki/Special:RecentChangesLinked/Patsnap
   66 RxHeader     c Host: www.en.wikipedia.org
   51 RxURL        c /wiki/The_Music_Man_%281962_film%29
   51 RxHeader     c Host: www.en.wikipedia.org
   51 RxHeader     c Referer: https://search.yahoo.com/
   70 RxURL        c /wiki/Myrna_Loy
   70 RxHeader     c Referer: http://search.yahoo.com/search?p=Myrna%20Loy&fr=iphone&.tsrc=apple&pcarrier=Sprint&pmcc=310&pmnc=120&d=%7B%22dn%22%3A%22yk%22%2C%22subdn%22%3A%22person%22%2C%22ykid%22%3A%225bb3f8c8-c73c-44d0-a883-76e74c900b61%22%7D
   70 RxHeader     c Host: www.en.wikipedia.org
   17 RxURL        c /wiki/John_Leguizamo
   17 RxHeader     c Host: www.en.wikipedia.org
   17 RxHeader     c Referer: https://search.yahoo.com/
   25 RxURL        c /wiki/Jurassic_Park_III
   25 RxHeader     c Host: www.en.wikipedia.org
   25 RxHeader     c Referer: https://search.yahoo.com/
  156 RxURL        c /wiki/Jeff_Goldblum
  156 RxHeader     c Host: www.en.wikipedia.org
  156 RxHeader     c Referer: https://search.yahoo.com/
  112 RxURL        c /wiki/Richard_Attenborough
  112 RxHeader     c Host: www.en.wikipedia.org
  112 RxHeader     c Referer: http://search.yahoo.com/search?p=richard+attenborough&fr=iphone&.tsrc=apple&pcarrier=Verizon&pmcc=311&pmnc=480
   51 RxURL        c /wiki/Media_Lab_(disambiguation)
   51 RxHeader     c Host: www.en.wikipedia.org
   41 RxURL        c /wiki/Jurassic_World
   41 RxHeader     c Host: www.en.wikipedia.org
   41 RxHeader     c Referer: https://search.yahoo.com/
^C

real	2m0.246s
user	0m8.852s
sys	0m0.416s

So, working off of @faidon's per-domain-count used in the donate ticket:

faidon@oxygen:~$ egrep '"www\.[a-z]{2}\.' per-domain-count |head -10
  10750 "www.zh.wikipedia.org"
   4398 "www.en.wikipedia.org"
   1996 "www.ru.wikipedia.org"
   1779 "www.en.wiktionary.org"
   1175 "www.de.wikipedia.org"
    871 "www.it.wikipedia.org"
    694 "www.nl.wikipedia.org"
    664 "www.pl.wikiquote.org"
    544 "www.sv.wikipedia.org"
    321 "www.it.wikipedia.com"

In the enwiki case, this amounts to 00.002% of the main domain hits. In the zh case, which seems to be the worst, it's more like 00.16%. Either way, it's not significant traffic, and AFAIK we never link or expose these anywhere, although yahoo seems to occasionally refer users (very low rate).

Basically, I don't know what else we can do here except go ahead and shut these off (patch ref'd earlier). They're as dead as we can make them in terms of exposure already.

BBlack claimed this task.