Page MenuHomePhabricator

Page gets redirected randomly to former blackout page
Closed, DeclinedPublic

Description

On March 25th 0:15, the Catalan Wikipedia redirected all pages to a blackout page (https://ca.wikipedia.org/wiki/Viquip%C3%A8dia:Comunicat_24_de_mar%C3%A7), by changing Mediawiki:Common.js (see https://ca.wikipedia.org/w/index.php?title=MediaWiki:Common.js&action=history). Later that day, the change was undone, and the redirect (to a different blackout page) was performed via Meta's Central Notice. The whole thing got undone on March 26th.

However, the Pageviews tool still shows around 1700 hits per day on the first blackout page. See https://tools.wmflabs.org/pageviews/?project=ca.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Viquip%C3%A8dia:Comunicat_24_de_mar%C3%A7. I thought it was perhaps a malfunction of the pageviews tool, but I have actually seen a friend of mine (from an iPhone, with Safari), trying to access an article after a Google search, and getting the blackout page. The same person accessed other articles seconds later without any problems. However, almost a week later, the first article still got blacked out.

Any idea on what may be happening?

Event Timeline

Joutbis created this task.May 15 2019, 7:10 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 15 2019, 7:10 PM
ArielGlenn added a subscriber: ArielGlenn.

Sounds like we need a redo of T199252 but for the Catalan Wikipedia. Performance team folks spearheaded this the last time, do they want to take it on again? Adding their project to get their take on it.

Adding @BBlack also because at least for a while he knew the state of these site maps. (Feel free to take yourself off this if that's no longer true.)

Vgutierrez added a subscriber: Vgutierrez.EditedMay 16 2019, 6:58 AM

This issue can be reproduced searching lliga de campions 2017 in google using a mobile browser, the first result pointing to ca.wikipedia.org is https://ca.m.wikipedia.org/wiki/Viquip%C3%A8dia:Comunicat_24_de_mar%C3%A7

Gilles added a subscriber: Gilles.May 20 2019, 8:17 PM

Ian Marlier participated in the site maps project as his own personal initiative, but it has always been out of scope for the Performance Team. And his knowledge of that project left with him, so we're not better equipped than anyone else to do something about this.

Gilles moved this task from Inbox to Radar on the Performance-Team board.May 20 2019, 8:17 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.
Krinkle added a subscriber: Krinkle.

I believe Readers took over the site maps roll out and analysis on the Google Search console, but not sure. Tagging Readers-Infra as first guess, but please re-triage elsewhere to Reading as needed.

Krinkle triaged this task as High priority.May 20 2019, 8:36 PM
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.
Krinkle moved this task from To Triage to Active Situation on the Wikimedia-Incident board.

This doesn't look like anything RI has worked on. @Tgr @phuedx does this ring a bell to you?

I'm currently the product owner for sitemaps and other SEO-focused changes but the technical work has been split across multiple teams with Ian, as mentioned earlier, being the one to lead the technical work on sitemaps. Over the past few months, @mpopov has been working on the analysis (T209720: Determine impact of sitemaps on search traffic to Indonesian, Portuguese, Punjabi, Dutch, and Korean Wikipedias) of the sitemaps which Ian deployed back in November. We were planning on scheduling any further deployments or technical work after the results, although this definitely seems like a special case.

phuedx added a comment.EditedMay 21 2019, 3:39 PM

Ian Marlier participated in the site maps project as his own personal initiative, but it has always been out of scope for the Performance Team. And his knowledge of that project left with him, so we're not better equipped than anyone else to do something about this.

Indeed. Reading through the history of T199252 and doing a little digging in operations/puppet, I think this is the current state of the sitemaps… project:

For the record, the blackout page is still the second most viewed page on the mobile web, right after the Cover.

mpopov added a comment.EditedMay 28 2019, 3:55 PM

Can confirm that the traffic to that page is almost entirely search referred visits:

My past itwiki & current initial analysis doesn't show sitemaps to have any significant effect either way so if a sitemap would fix the issue then we should be good to go with it. Alternatively it looks like the mobile web search-referred traffic to that page is on a downward trend with big drops every now as Google does indexing passes, so we may be fine just letting it go.

We may want to think of a solution the community can employ for these kinds of blackouts that doesn't require a sitemap generation & deployment after the fact. Just a thought.

Is anybody working on that?

We may want to think of a solution the community can employ for these kinds of blackouts that doesn't require a sitemap generation & deployment after the fact. Just a thought.

Definitely something that should be looked at. I haven't dug into how these blackouts are executed, but I suspect it's site JS doing a redirect (possibly 301 "Permanent" in effect?) that search engines canonicalize. Someone should probably look into whatever common method site admins are using for this and suggest a different technical method that looks more-temporary to search engines at least. Or maybe we should write an extension we can deploy that gives these administrators an easy way to do it correctly without these unintended long-term consequences?

For that matter, sitemaps themselves should probably have a more-holistic solution that's implemented within the scope of MediaWiki. I don't think the current sitemaps.wikimedia.org solution was ever intended to be permanent, just expedient. Google needs the data to come from the same domain (so e.g. https://it.wikipedia.org/sitemap), which is why those Varnish hacks are in place to internally rewrite the requests into the separate sitemaps server that hosts the static (but updateable) content of them. It would seem better in the long run, if we want sitemaps, to have MW itself capable of serving them at their native URIs and some maintenance scripts to update the sitemap content periodically on the appservers themselves.

In the short term about this particular case, we (SRE/Traffic) can definitely update that Varnish config to allow some more sitemap rewrites, and can sort out any trivial hurdles in getting sitemap data uploaded to the sitemaps server, but I have no idea about how to generate/validate the actual sitemap data itself.

Tgr added a comment.May 30 2019, 1:34 PM

Can you even trigger a 301 from Javascript? I don't think it's possible.

Krinkle added a comment.EditedMay 30 2019, 3:36 PM

We may want to think of a solution the community can employ for these kinds of blackouts that doesn't require [..]

[..] I suspect it's site JS doing a redirect (possibly 301 "Permanent" in effect?) that search engines canonicalize. Someone should probably look into whatever common method site admins are using for this [..]

We already researched this last year. In short:

  1. JavaScript overlays. This is how we did the big SOPA blackout on English Wikipedia. We just made the page invisible with CSS and overlaid the blackout page dynamically. No change in server response. No redirect, reload, or forwarding of any kind. And no dynamic removal or modification of content, only adding a full screen modal. This meant that even smart search engines that execute and wait for JS, would still find the complete article content where it normally would be. It just has an interstitial modal (like cookie banners do), except that we didn't offer a way to close it.
  1. JavaScript reloads. This is what Italian Wikipedia did last year (details at T199252). It basically just means MediaWiki:Common.js on the wiki executed the following code: window.location.href = 'https://example/wiki/Project:Black_out_page';. While completely non-standard and undocumented (to my knowledge), it turned out that at least Google, when executing the JS in its crawler, took this to mean as being logically equivalent to a redirect and decided to instead index the next navigation and its url. Thus triggering similar behaviours in Google's crawler as when <link rel=canonical> is used, or when a 301 redirect is encountered.

maybe we should write an extension we can deploy that gives these administrators an easy way to do it correctly

Already requested and researched. See T198890#4550160. Blocked on having a use case, priority, and resourcing.


This task is about fixing the mess on Italian Wikipedia, because the sitemaps work helped, but ultimately did not fix it. There are still many articles Google has a seemingly sticky/outdated memory of as being "redirected" to this blackout campaign. Despite that not being the case for over a year.

Krinkle closed this task as Declined.Nov 8 2019, 6:24 AM

I assume the disinterest means the absence of bad routing of some Google results for Italian Wikipedia to last year's blackout page will not be further investigated and will instead slowly but surely (hopefully) recover on its own.

I'm closing this as such. However it's definitely a strong signal to future considerations in this area to definitely not follow the methodology used by itwiki for future endeavours.