Page MenuHomePhabricator

Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster)
Open, HighPublic

Description

There are currently intermittent page load issues on the German Wikipedia. These can be caused by HTML documents loading slowly, especially on special pages, or resources loaded through the resource loader. I am seeing page load times of upwards of 150 seconds.

The issue affects different users with different browsers and internet service providers. The problem has appeared before, for example on 6th/7th June 2019. At the time, it was explained to be related to ongoing software deployments on WP:Fragen zur Wikipedia.

Update: Evidence so far suggests that users in several European countries are affected, including Switzerland, Germany, Finland and France. This would probably point to some problem in the Amsterdam cluster.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Pruem renamed this task from Sometimes some pages load slowly on de.wp in Europe (due to some factor outside of Wikimedia cluster) to Sometimes some pages load slowly for European users (due to some factor outside of Wikimedia cluster).Thu, Jun 20, 3:08 AM
Pruem updated the task description. (Show Details)
CDanis added a subscriber: ema.Thu, Jun 20, 3:24 AM

My guess is that the beginning of this problem correlates with the beginning of the fetch failures in the first graph panel here:
https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-7d&to=now

I think @ema was taking a look at some thread exhaustion within Varnish in esams? Although based on https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?panelId=16&fullscreen&orgId=1&from=now-7d&to=now I am not sure that this is the same problem.

Some additional observations:

I'm now half-way in the process of rebooting all cache nodes with a patched kernel fixing CVE-2019-11477 and related vulnerabilities. Once I'm done with that, we can re-enable SACK and see if that changes anything. CC @MoritzMuehlenhoff.

ema moved this task from Triage to Caching on the Traffic board.Thu, Jun 20, 9:13 AM
Zache added a subscriber: Zache.Thu, Jun 20, 9:15 AM
Tomybrz updated the task description. (Show Details)Thu, Jun 20, 9:15 AM
Tomybrz added a subscriber: Tomybrz.
Tgr added a subscriber: Tgr.Thu, Jun 20, 9:19 AM

Not limited to dewiki / Germnany (unsurprisingly), there have been a bunch of reports from huwiki/Hungary as well.

I saw many slow en wiki page loads yesterday, including missing skins. (Logged in user, Europe.)

@Yann: There is nothing new in that thread? Feel free to add {{tracked|Txxxxxx}} on-wiki in such cases.

Interested in the solving of this pb. What I can say is that a common template (https://www.mediawiki.org/wiki/Template:Graph:Lines) whichever wiki it is used (de, fr, it, etc), does not render the graph if linked to a wikidata query.
What is funny, is the fact that if you hit "preview", the graph is viewable. But when saving the changes, no graphs, only a bad square/blank shown.
Thanks for your work!

Seems to be better. Solved by the rebooting mentioned by @ema ? I don't experience that slowness anymore.

Wurgl added a comment.Thu, Jun 20, 2:03 PM

Agree to @Gestumblindi !

I just tried and everything is less than 3 seconds.

PM3 added a comment.Thu, Jun 20, 2:39 PM

+1, everything running smoothly now, including API queries.

PM3 renamed this task from Sometimes some pages load slowly for European users (due to some factor outside of Wikimedia cluster) to Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster).Thu, Jun 20, 2:40 PM
ema added a comment.Fri, Jun 21, 4:57 AM

My guess is that the beginning of this problem correlates with the beginning of the fetch failures in the first graph panel here:
https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-7d&to=now

@CDanis Your guess seems spot on, fetch errors went to essentially 0 at 2019-06-20T10:00 and after that we started getting the first reports that the problem was fixed. See also the graph of "Resource temporarily unavailable - straight insufficient bytes" errors: https://logstash.wikimedia.org/goto/4f9a6fdf4665b439713b7e40226512dd. "Resource temporarily unavailable" is EAGAIN, indicating a timeout.

Speculation: @Wurgl reported response times of ~180 seconds. There are 3 varnishes between esams users and the application layer (esams varnish-fe and varnish-be, eqiad varnish-be). A theory for what might have happened is thus that responses affected by the issue got delayed by a 60s timeout happening 3x. Cache reboots for (unrelated) kernel/varnish upgrades solved whatever weird state varnish got itself into.

Wurgl added a comment.Fri, Jun 21, 5:35 AM

@ema It was not a single timeout. A chunk of data (the start of the page) was rendered by the browser very fast, but it stuck. And from that moment the browser added line by line, sometimes character by character, sometimes larger chunks arrived. When I scrolled down to the end of the page, I could read the data faster than it arrived. As I said in the IRC channel: Slow like a acoustic coupler.

Jheald added a subscriber: Jheald.Fri, Jun 21, 10:41 AM

^^ I saw exactly what Wurgl reports, browsing long pages on en-wiki (eg the Fram case) in London. Now seems okay.

Pruem added a comment.Sat, Jun 22, 4:32 AM

Shall we set this to resolved then? The active discussion on how to prevent this in the future seems to be taking place under another ticket.

Krinkle closed this task as Resolved.Sat, Jun 22, 2:07 PM
Krinkle assigned this task to ema.
Wurgl reopened this task as Open.Sun, Jun 23, 10:18 AM

Sorry to reopen that issue, but the behaviour is back :-(

I see the slowness it again. But when running that test-loop, just one in 100 tries: again 3 minutes, 1,783 seconds all 99 other tries are within 1,4 and 2,9 seconds

I just experienced this too. One minute it’s fast, the next it’s really slow.

PM3 added a comment.Sun, Jun 23, 10:56 AM

I don't experience any problems since Thursday. Today also everything is running smoothly.

One out of 100 could be just some normal network congestion or temporary local system lag.

Wurgl added a comment.Sun, Jun 23, 2:04 PM

It is strange, really strange. I have seen that slowness three times within a few minutes on my watchlist, but now I do not see it anymore? Did the servers take a break for a cup of tea?

Vort added a subscriber: Vort.Sun, Jun 23, 5:39 PM

Did not read all comments, but want to say that this problem is way older than several weeks.
It occurred for years, but was very rare.

ema added a comment.Mon, Jun 24, 4:31 PM

We believe that Varnish fetch failures might be related to this issue, investigation is ongoing T226375

@Community-Relations Just a heads-up in case you've heard anything around this in the communities related to slow page views or time outs. These kinds of reports are usually out of our control and all unrelated to each other, but many of the reports in the past 7 days turn out to be related to each other and in an area we are responsible for.

It should have mostly settled by now, but look out for anything new :)

Gilles moved this task from Inbox to Radar on the Performance-Team board.Mon, Jun 24, 8:00 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

(The task is titled "European users", but more precisely it affects users for whom we route Wikipedia connection routed to the Amsterdam cluster. This includes many areas might think of as being outside Europe.)

ema renamed this task from Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) to Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster).Tue, Jun 25, 8:07 AM
Lofhi added a subscriber: Lofhi.Tue, Jun 25, 4:11 PM
Zache added a comment.Thu, Jun 27, 7:52 AM

Reported again in fiwiki by three users. One commented also that mobile site was working better.

Can we get approximate times for these last reports, @Zache ? (UTC if possible)

Ejs-80 added a comment.EditedThu, Jun 27, 8:14 AM

@ArielGlenn, two fiwiki users reported about this between 03–03:35 (UTC) and one user (who said that only mobile phone interface worked) at 7:49, 27 June 2019 (UTC) .

The 03-03:35 incidents were likely related to network issues we were experiencing during that time. The later report seems like a separate incident.

Elitre added a subscriber: Elitre.Thu, Jun 27, 10:38 AM

@Community-Relations Just a heads-up in case you've heard anything around this in the communities related to slow page views or time outs. These kinds of reports are usually out of our control and all unrelated to each other, but many of the reports in the past 7 days turn out to be related to each other and in an area we are responsible for.
It should have mostly settled by now, but look out for anything new :)

Ciao, sorry for not replying earlier - in the future, we recommend using the Specialists Support tag and exploring better options for quick interventions at https://office.wikimedia.org/wiki/Community_Relations#Support :)

Personally, I haven't seen anything other than https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#was_en.WP_out_of_service_just_now? (VPT is a good place to look for reports as it's often also used by non en.wp editors.)

If there's anything specific we should ask around and any wiki you want targeted with such requests, please let us know. TYVM!

If there's anything specific we should ask around and any wiki you want targeted with such requests, please let us know. TYVM!

It is something we can ask in Tech News if needed.

I'm writing the next issue now: how would you describe the problem is one or two short sentences?

To clarify, Trizek isn't asking me.

Risker added a subscriber: Risker.Fri, Jun 28, 3:44 AM
ema added a comment.Fri, Jun 28, 6:24 AM

@Trizek-WMF: personally, I've tried and failed for days to reproduce the issue. My understanding is that occasionally some page loads for logged-in users take a very long time to complete, with characters slowly showing up on the screen.

I'm not sure if people are still encountering this.

Pruem added a comment.EditedFri, Jun 28, 7:22 AM

@Trizek-WMF: It would help if you stated what exactly are you interested in: The problem resolution from a technical perspective or the end user experience. The end user experience is well summed up in Gestumblindi's above comment of Tue, Jun 18, 23:05 (the ''acoustic coupler'' metaphor). At least that is how I experienced it.

Vort added a comment.Fri, Jun 28, 12:02 PM

Problem was reproduced by me just now, on ruwiki.

TheDJ added a subscriber: TheDJ.EditedFri, Jun 28, 12:20 PM

Just happened to me again as well. Unfortunately i didn't have the inspector open when loading the page, so i couldn't find by any of the HTTP headers. Next call was just fine. This occurred around: 28 Jun 2019 12:17:11 GMT

Vort added a comment.Fri, Jun 28, 12:56 PM

Here is a screenshot of laggy connection:


As you can see, packets are 622 bytes in size, received every second, which means speed of 0.5 KiB/s.
And no retransmissions. Network connection works fine from my side.

Aka removed a subscriber: Aka.Fri, Jun 28, 1:59 PM

This was published in https://meta.wikimedia.org/wiki/Tech/News/2019/27 . Anything else required from Specialists here? Thanks.