Page MenuHomePhabricator

Improve access to local language wikis by fixing bug in generation of hreflang tags in <head> of article pages
Open, Stalled, MediumPublic

Assigned To
None
Authored By
Stu
Mar 19 2015, 4:11 PM
Referenced Files
F201448: en-wikipedia-org_20150718T214126Z_SearchAnalytics.csv
Jul 20 2015, 5:17 PM
F201427: sitemap-builder-script.php
Jul 20 2015, 5:05 PM
F201428: sitemap.xml
Jul 20 2015, 5:05 PM
F108675: IMG_2849.JPG
Apr 3 2015, 6:24 PM
F108673: IMG_2849.JPG
Apr 3 2015, 6:23 PM
F108671: IMG_2849.JPG
Apr 3 2015, 6:23 PM
F108666: IMG_2849.JPG
Apr 3 2015, 6:20 PM
F102857: hreflangblockhack.diff
Mar 22 2015, 9:42 PM
Tokens
"Love" token, awarded by santhosh.

Description

Following up on https://lists.wikimedia.org/pipermail/wikitech-l/2015-February/080661.html. Quoting:

MediaWiki emits the hreflang attribute on language links, but only as part
of the links in the body, and not in the <head> as recommended by Google
[1]. The result of this is that Google (and possibly other search engines)
doesn't interpret the hreflang attribute for purposes of prioritizing
search results in the user's own language.

From a contact at Google we asked about this: "we currently don't use those
annotations on the links, we need to see the hreflang link-elements in the
head in order to understand that connection. The important parts there are
the we need to have them in the "head", we need to have them confirmed from
the other versions (so DE needs to refer to EN, and EN to DE -- it can't be
one-sided), and it needs to be between the canonical URLs. (...)  I imagine
if you just added the cross-links as you have them in the sidebar as
link-elements to the head then you'd be covered."

This of course would add some additional payload to pages with lots of
language links, but could help avoid results like [2] where the English
language version of an article is #1 and the Indonesian one makes no
appearance at all. Results vary greatly and it's hard to say how big a
problem this is, but even if boosts discoverability of content in the
user's language by only 10% or so, that would still be a pretty big win for
local content.

Our end goal is that all the variants of a page have the exact same hreflang block in their <head> e.g. for the Menlo article all three variants on the three different language wikipedias would have this exact block:

<link rel="alternate" hreflang="en" href="//en.wikipedia.org/wiki/Menlo,_County_Galway" />
<link rel="alternate" hreflang="ga" href="//ga.wikipedia.org/wiki/Mionloch" />
<link rel="alternate" hreflang="ru" href="//ru.wikipedia.org/wiki/%D0%9C%D0%B5%D0%BD%D0%BB%D0%BE%D1%85_(%D0%B7%D0%B0%D0%BF%D0%B0%D0%B4%D0%BD%D1%8B%D0%B9)" />

I'm new to mediawiki development but did a bit of hacking on the easy parts of this in /includes/OutputPage.php, trying to re-use the array generated for the interwiki language links in the sidebar. Diff attached that uses dummy data for now since I was doing locally. It's just a start lots left to do like get the current page info so we have it as an alternate too, doing this efficiently, and lots of testing. I'm at the limits of my knowledge so will need some help/pointers.


See also:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Ugh, I really have scalability concerns with potentially putting several KBs worth of language links into the <head>.

Those are all bytes we need to download before we can even start building the DOM. On a spotty UTMS connection this might be noticeable on the Barack Obama-page

Fair point. <hreflang> tags for the Barack Obama page with its 200+ language alternates works out to a few KB once gzipped. The page does have 1+ MB of HTML though so not sure how meaningful it is.

An alternative would be to construct sitemap files that include all the language variants. Google added support for this about 3 years ago. https://support.google.com/webmasters/answer/2620865?hl=en That would be a lot of sitemaps to build correctly and deploy across all the different language versions of Wikipedia though.

My gut is that, because it's quicker to implement, we go ahead in the <head> for now to see the impact. If it ends up making the difference we think it'll make, we can add switching the language links over to sitemaps to our list of performance improvement projects.

To quantify the performance impact, the page about Obama is now 1335065 bytes (1.3M). Compressed (I tested with gzip -c) is 237451 bytes. Of that size, language links constitute 43756 bytes uncompressed. If we add hreflangs, it would be no more space than language links take - since those have the same data plus language name. I did a quick and rough script to generate hreflang links from existing links and it came out as 22919 bytes. So inserting it into the original article text becomes 1357967 bytes (small differences due to whitespace reformatting) and compressed it's 240158 bytes - or 2707 bytes difference, or 1% increase of the compressed size.

2K/1% increase is not nothing, but also probably not something critical if it improves the results.

Hi Stas -- what is the impact on smaller pages? Is the ratio of hreflangs to test larger on these articles?

-Toby

Also remember that the <head> tag is very influential on how the page is being parsed. This is not just content that needs to be downloaded, it's content that needs to be downloaded BEFORE the browser can start parsing and rendering the body. With the suggested patch, it's even content that needs to be downloaded before it can start the request to download the stylesheet.

And soon we might have 'social media' cards and I don't know what else in there (there's a couple more tickets with possible <head> elements that are being proposed). I'm not saying it will have impact, my point is that I don't understand the impact, but that I do see risk. Since the performance team is actively trying to optimize the hell out of the download and render sequences, they should probably weigh in on it.

@Tnegrin: I imagine that depends on the ratio between how many language links we have and how big the article is. I won't be bigger than the language links and it should compress pretty well, but we're talking about 100 bytes per language, uncompressed, or about 10 bytes per language compressed.

@TheDJ I understand your concern, but if we want Google to handle the language links, I don't see too many ways around it. Given our scale, I'm not sure how sitemaps would work. The <head> issue is definitely a concern, so suggestions on this are welcome. Maybe make it agent-dependent? I.e. show it to Google but not to mobile clients, maybe not in mobile layouts?

I am also thinking maybe this should be a preference and then mobile etc. could control this preference so that when we're running on mobile (or maybe somehow otherwise detect we don't want these links) we could change this preference.

Change 218770 had a related patch set uploaded (by Smalyshev):
Disable generation of hreflang headers for mobile

https://gerrit.wikimedia.org/r/218770

Check out https://gerrit.wikimedia.org/r/#/c/198696/ and https://gerrit.wikimedia.org/r/#/c/218770/ - this adds an option to turn it off globally (it's off by default) and by extension like MobileFrontend. This should make handling the overhead easier.

@Smalyshev isn't this what we have a performance team for ?

I don't see why we HAVE to have google handle these links... We can also try talking to them.

  • hreflang attributes on <link rel=alternate> tags in <HEAD> are different from hreflang attribute on <a href=...> tags. Per RFC 5988, the hreflang attribute on <a href...> tags simply indicates the language of the linked resource. One is "this link is in Klingon", another - "if you want Klingon version of this page, look here". So we're not wrong to be setting it.
  • When used with <link rel=alternate> tags, the hreflang attribute implies a translated version of the page. However, articles in different languages are rarely direct translations of one another. So using it for interwiki links may be inconsistent with the specification.
In T93213#1389249, @ori wrote:

So using it for interwiki links may be inconsistent with the specification.

See comments upthread and in initial discussion that this whole ticket began with a comment from John Mueller at Google's Webmaster Tools team about the addition of <hreflang> block to link different language articles on a topic: "That sounds like a great use of hreflang!" So we can be pretty confident this is consistent with best practice.

In T93213#1389405, @Stu wrote:

See comments upthread and in initial discussion that this whole ticket began with a comment from John Mueller at Google's Webmaster Tools team about the addition of <hreflang> block to link different language articles on a topic: "That sounds like a great use of hreflang!" So we can be pretty confident this is consistent with best practice.

Ok, fair point. I'd be fine with giving this a shot, then. Let's scope this to a small handful of articles at first, as you suggest. This would give the performance team time to analyze the performance impact of this change before it rolls out to all pages everywhere, but get the ball going on evaluating the impact this has on traffic.

Regarding performance: one thing we could consider doing is to generate the sidebar interlanguage links in JavaScript, post-load, by inspecting the hreflang link tags in <head>. This would neutralize the impact of this change on the total page load size.

Small handful of articles meaning some smaller-sized wiki or part of bigger-sized wiki? About the latter I'm not sure how to do it, for the former I guess it's not hard.

Small handful of articles meaning some smaller-sized wiki or part of bigger-sized wiki? About the latter I'm not sure how to do it, for the former I guess it's not hard.

Well, the links have to be symmetrical -- see the section on "Missing confirmation links" under "Common Mistakes" in https://support.google.com/webmasters/answer/189077?hl=en . You could limit it to a half-dozen reasonably-popular articles in a crude way by just testing for those particular titles in the code and submitting the patch to the production branch rather than master. Or by temporarily staging it in an extension instead.

Hmm true. We'll need to think how to do such local testing, just one wiki does nothing. I'll try to make something.

BTW, as someone speaking multiple languages, having google always trying to hide English results is really annoying for me. I have Google configured to return results in multiple languages (an option that very few users of Google use probably), so I really do wonder how this would affect my search results...

Made an extension that implements the same behavior and allows to limit it to specific pages, for easiness of testing:
https://github.com/smalyshev/Hreflang-extension

In T93213#1389496, @ori wrote:

Small handful of articles meaning some smaller-sized wiki or part of bigger-sized wiki? About the latter I'm not sure how to do it, for the former I guess it's not hard.

Well, the links have to be symmetrical -- see the section on "Missing confirmation links" under "Common Mistakes" in https://support.google.com/webmasters/answer/189077?hl=en . You could limit it to a half-dozen reasonably-popular articles in a crude way by just testing for those particular titles in the code and submitting the patch to the production branch rather than master. Or by temporarily staging it in an extension instead.

Yeah - a simple in_array with a list of 10 pages in a ton of languages would probably be fine. You propose doing it in an extension or release branch because you don't want such silly hacks in master? We could do something fancier but its probably not worth it.

You propose doing it in an extension or release branch because you don't want such silly hacks in master? We could do something fancier but its probably not worth it.

I'm simply giving you a green light for kludging it, in the interest of getting something out quickly so we can assess its impact and prioritize further work. You don't have to kludge it if you think you can devise a clean and elegant implementation from the start.

In addition to HTML tags and XML site maps, one additional way language headers can be expressed is via Link: HTTP headers. The nice thing about using headers is that headers can easily be scrubbed from responses by Varnish, so we could mitigate the increase in response size for clients by sending the header to search bots only. @Stu, could you possibly ping John Mueller to ask if targeting Googlebot in this fashion (i.e., by sending it Link: headers that we don't send to other clients) is fair game?

@Stu, could you possibly ping John Mueller to ask if targeting Googlebot in this fashion (i.e., by sending it Link: headers that we don't send to other clients) is fair game?

@ori -- I sent an email with cc to you and @Wwes. Feel free to reply all in case I missed anything. It's good to get you and Wes connected with John anyhow so I'm not slowing communication down.

Please also update here when we find what's up with Link: one way or another.

Heard some feedback from the team at Google:

It looks like you guys aren't big fans of our sitemaps' page? :)

I think you can avoid yourselves a lot of headaches around speed by using the actual sitemap to inform us of the alternate pages: https://support.google.com/webmasters/answer/2620865?hl=en

That seems brittle to me in production, but perhaps solvable. There's a 50,000 URL limit for each sitemap file, and with ~35 million articles and lots of connections among 288 different URLs e.g. en.wikipedia.org and de.wikipedia.org we'd need to generate maybe 10,000 different sitemap files and create a sitemap index, then figure out if we’ve verified all 288 different wikipedias because those 15k sitemaps are all cross-site. Then we’d need to build something to maintain those and republish regularly.

Regardless of that, a sitemap would be a super convenient answer for a test. @ori if you haven't done a production test already about how doing with a sitemap file? We can keep debating best production solution.

I looked at using sitemaps. It won't be easy. The way Google sets up sitemaps for hreflang requires a complete redundant list of alternates for each page. This leads to large sitemaps.

generates a first pass of a for a single topic -- Pluto -- which has something like 167 different alternates on different language wikipedias. The sitemap has to list all 167 alternates for each one of those 167 alternates, which leads to a sitemap with 167*167 or 27,889 links. Since Google limits sitemap files to 50,000 links each, our millions of article topics would likely required hundreds of thousands of unique regularly updated sitemap files.

At this point I think we should just deploy the existing patch to add the hreflang alternates to the head for a set of high traffic articles so we can see the impact. We can measure the search visibility result within 2-3 days by seeing if the Google Search Console shows an increase in local language wikis being shown in search results (e.g. for hiwiki). Then we can make an informed judgement about the performance tradeoffs and whether it's worth the investment in sitemaps or trying to make @ori 's googlebot-only-http-header hack work.

I think we should do more like 100 articles, rather than just 5 or 10, so the impact is obvious looking at the Google Search Console stats.

is a .csv of the top 500 most-clicked enwiki articles in Google searches last week we could start with the top 100.

One last note. I see that sr and zh wikipedia pages are currently generating hreflang tags for each language's alternate scripts (e.g. today's zh featured article). Probably should exclude those languages from this test.

If we go with @Smalyshev 's extension approach, let's be sure to also remove the code that generates the incorrect x-default tag on pages. IIRC Google just ignores hreflang tags if it gets conflicting or incorrect ones on a page so if we leave the wrong one in it would nullify our fix..

@Stu I'm not sure I can do that as an extension, but we definitely can make a patch that just removes the x-default part, that would be easy one to get through since it shouldn't break anything except for not doing the wrong thing.

In T93213#1454487, @Stu wrote:

It looks like you guys aren't big fans of our sitemaps' page? :)

In T93213#1465459, @Stu wrote:

I looked at using sitemaps. ...The sitemap has to list all 167 alternates for each one of those 167 alternates, which leads to a sitemap with 167*167 or 27,889 links. Since Google limits sitemap files to 50,000 links each, our millions of article topics would likely required hundreds of thousands of unique regularly updated sitemap files.

And that's why we are no fans :)

@Smalyshev, from a few spot checks it looks like the bad x-default line isn't being generated on article pages anymore but is still showing up in other namespaces e.g. on https://en.wikipedia.org/wiki/Wikipedia:Contact_us or https://en.wikipedia.org/wiki/Portal:Contents or https://en.wikipedia.org/wiki/Help:Contents.

@Stu I see x-default header on https://zh.wikipedia.org/wiki/%E7%A5%9E%E5%A5%87%E8%83%B8%E7%BD%A9 - not sure if it should be there. Don't see it on https://en.wikipedia.org/wiki/Portal:Contents though. Maybe it depends on some settings?

I see x-default header on https://zh.wikipedia.org/wiki/%E7%A5%9E%E5%A5%87%E8%83%B8%E7%BD%A9

@Smalyshev, that's interesting. Yep it looks like zh and srgenerate hreflang alternates for different scripts I think with this code. AFAIK that implementation is pretty good and if they do some auto-redirecting then x-default could belong. See https://phabricator.wikimedia.org/T54429.

Don't see it on https://en.wikipedia.org/wiki/Portal:Contents though

I see it when logged out / incognito but not when logged in. That makes some sense -- in theory we only need to show hreflang tags to logged out users since search engines are always logged out.

cc @Krinkle who I saw on a few related commits e.g. https://phabricator.wikimedia.org/rMW208983b6d1f5383b4d58b31ca092e87605e05e0d

in theory we only need to show hreflang tags to logged out users since search engines are always logged out.

That's a good point. Probably need to add it to the extension too.

Can someone summarise where we are at now? Preferably by updating the description. I have a patch open for mobile that I'm keen to get resolved one way or the other.

Are there open questions that need answering? Are we sure hreflang tags in the head are the best solution?

Change 228418 had a related patch set uploaded (by Smalyshev):
T93213: drop x-default tag

https://gerrit.wikimedia.org/r/228418

@Stu - so what is out status here - are we going with tags in <head> or Link: header?

I've also made a patch that drops x-default: https://gerrit.wikimedia.org/r/#/c/228418/

Change 228418 had a related patch set uploaded (by Santhosh):
Fix the wrong usage of x-default tag

https://gerrit.wikimedia.org/r/228418

The Google help page says that rel=alternate hreflang=x should only be used for cases where the content is fully translated, in order to "serve the correct language or regional URL in Search results". To me, this implies that "incorrect" languages will be completely suppressed from the search results. So if you are searching in Welsh, you will only see the Welsh version of a given page, even if the English one is 100 times larger.

Wikipedia is quite a big and important site, so I don't think we should use it for this sort of experimentation unless we are really sure that we know what we are doing. We should either have Google staff in the loop, or use a test site to confirm that the results are limited to ranking rather than complete removal of search results.

As for x-default being broken, I don't think there is enough information on the Google help page to determine that our usage of it is incorrect. I think it is likely that x-default is used as a search result when the client's desired language can't be determined, in which case the existing usage is correct, as long as rel=alternate continues to be used only for variants. But just because I think it's likely doesn't mean it is true. Again, I think we should either ask relevant Google engineers or run experiments to find out for ourselves.

We should either have Google staff in the loop

Repeating comment at https://phabricator.wikimedia.org/T93213#1389405:

See comments upthread and in initial discussion that this whole ticket began with a comment from John Mueller at Google's Webmaster Tools team about the addition of <hreflang> block to link different language articles on a topic: "That sounds like a great use of hreflang!" So we can be pretty confident this is consistent with best practice.

@Stu, just because google staff says that it's good for them doesn't mean that it's good for us. And kicking something off with a cheerleading comment, isn't the same as being actively in the loop either.

To me, this implies that "incorrect" languages will be completely suppressed from the search results. So if you are searching in Welsh, you will only see the Welsh version of a given page, even if the English one is 100 times larger.

Especially this is something that I still haven't seen a proper answer on by Google and which would seem highly undesirable to me. We would have to get an answer on this.

And lastly, I still don't see why from a technical perspective Google needs these in the HEAD. It might be what the standard demands, but perhaps our usecase should lead to re evaluation of the standard in this case.

Jdlrobson changed the task status from Open to Stalled.Oct 1 2015, 5:30 PM

I don't know what's going on here anymore.

AFAIK, Brion's comment in 2009 still applies. Nice to work on this but better implement a minimal fix of T3433 first.

Smalyshev lowered the priority of this task from Medium to Low.Mar 10 2016, 10:44 PM

Change 218770 abandoned by Jdlrobson:
Disable generation of hreflang headers for mobile

Reason:
Abandoning due to inactivity on patch and bug. Feel free to restore this at a later date.

https://gerrit.wikimedia.org/r/218770

Change 198696 abandoned by Santhosh:
Generate hreflang tags in <head> of article pages

Reason:
Abandoning since the bug is stalled.

https://gerrit.wikimedia.org/r/198696

Abandoning since the bug is stalled.

I don't quite understand this rationale...

I don't quite understand this rationale...

Nor do I.

See also T141506#2517800 (for several Wikipedias, the number of hreflang tags as counted by Google Search Console dropped to basically zero recently)

I've been consulting with Go Fish Digital, a search engine optimisation firm, and this is something they recommended to us. In particular, they said to remove the hreflang from the links visible to users in the sidebar, and then add hreflang tags in the <head> section.

Note that AFAIK google is using Parsoid-format HTML for indexing, not our PHP front end output. Also: they have their own dedicated pipeline for wikipedia content. Not sure that adding hreflang tags to the PHP output will do anything at all. We used to emit interlanguage links in Parsoid output ( https://www.mediawiki.org/wiki/Specs/HTML/1.6.x#Language_links ) but I thing most (all?) of these went away when we switched to using WikiData for interlanguage links.

It would be worth talking to google about their pipeline; it may be easier to have them directly integrate info from wikidata into their search engine -- T198946 asks to have them better link the wikidata item for a page, and the interlanguage links are properties on that item.

Change 228418 abandoned by DCausse:
[mediawiki/core@master] Fix the wrong usage of x-default tag

Reason:
bug stalled

https://gerrit.wikimedia.org/r/228418