Page MenuHomePhabricator

Increasing server errors on Wikipedia.org portal page
Closed, DeclinedPublic

Description

It looks like we've been having a large amount of googlebots getting server errors recently and we need to make sure it's not something on our end.

They seem to be limited to the desktop experience and a few of the URLs that are throwing the server errors are (mostly 429 responses, but a few 503 response codes as well):

https://www.wikipedia.org/wiki/Women's_National_Basketball_Association 429 error
https://www.wikipedia.org/wiki/Bulyea_Heights,_Edmonton 429 error
https://www.wikipedia.org/wiki/Sweet_Pea_(song) 429 error
https://www.wikipedia.org/wiki/Stanley:_The_Search_for_Dr._Livingston 429 error
https://www.wikipedia.org/wiki/Michael_Green_(writer) 503 error

There has been a recent change that we've noticed with Google referrals—when a user searches for wikipedia, google then gives them a direct-in-Google search option rather than directing the user to www.wikipedia.org.

Our dashboard shows a recent decline as well:

portal-search_engine_referrals-oct2017.png (498×887 px, 84 KB)

Event Timeline

From @Jdrewniak:

This is odd to say the least.

We haven't made any functional changes to www.wikipedia.org since August 24 when we fixed a minor bug on the search suggestion feature.
https://phabricator.wikimedia.org/diffusion/WPOR/history/master/

Let's go through this line by line.

Lines 2 - 7
The first few of these links are translation files from our asset folder that we use to serve the translated page, like
https://www.wikipedia.org/portal/wikipedia.org/assets/l10n/de-857bf7dc.json

These files are often deleted and replaced with new files whenever a translation changes.
So the file above would have been deleted and replaced with this file:
https://www.wikipedia.org/portal/wikipedia.org/assets/l10n/de-99bbc3a1.json

We've done a lot to make sure these translation files are invalided & updated properly in Varnish, so I have no idea why Googlebot would be requesting these files. Googlebot might be caching an older version of the index.html file, which is responsible for fetching the translation files. From the spreadsheet though, it looks like this has only happened 4 times within a short span of time in September, so I don't think that's the big issue here.

(I have no idea why it would request a file like "index-non-existingX.js" though...)

Lines 8 - 19
The links starting with "https://www.wikipedia.org/search-redirect.php?" are curious because they span a greater length of time, from May to October. These links are how we do search on the portal. A full search URL has parameters like "search" for the query, and "language" for the Wikipedia language version.
https://www.wikipedia.org/search-redirect.php?family=wikipedia&language=en&search=poop&language=de&go=Go
The URLs in the spreadsheet however, are malformed, and some are literally truncated, ending with "dot dot dot" like this one:
line10: https://www.wikipedia.org/search-redirect.php?family=wikipedia...

Maybe Google (or someone else) is trying to use the same search input we have on the portal in an app or different webpage, and they're just using it wrong?

Line 20 - 1019
The thing that sticks out about these links is that they all returned a 429 status code. That means "Too many requests". That's our server saying "stop it". This is usually caused by something spammy, and the fact that they all occured between October 27th and 29th leads me to beleive this was some sort of Bot gone wild incident. The fact that these links all come from www.wikipedia.org doesn't necessarily mean that we put them there. There could be apps or browser extensions that modify the page and do wierd things.

Next steps
It's hard to tell why someone or something would be trying to access these links without more information.

We could ask one of our analysts (Mikhail, Chelsy, I see you on this thread) to look at the webrequest analytics for the 429 requests, to see if they indicate bot activity or not, like if they all come from the same IP or have the same user-agent.

Also, we could look into building a sitemap.xml for Google, that would list all the URLs on a page that we want google to crawl. https://support.google.com/webmasters/answer/183668?hl=en#addsitemap That might at least tell google to stop crawling the old translation files, maybe even get it to ignore the illegitimate links.

We'll keep this open for a bit longer, just to see if these errors come back. Thanks, @Jdrewniak !

debt lowered the priority of this task from High to Low.Nov 2 2017, 3:48 PM
debt moved this task from Backlog to Done on the Discovery-Portal-Sprint board.

Looks like things have settled down for now, closing this as declined, since there isn't any work for us to do.

portal-search_referral-numbers_7Nov2017.png (500×1 px, 123 KB)