Page MenuHomePhabricator

Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster)
Closed, ResolvedPublic

Description

There are currently intermittent page load issues on the German Wikipedia. These can be caused by HTML documents loading slowly, especially on special pages, or resources loaded through the resource loader. I am seeing page load times of upwards of 150 seconds.

The issue affects different users with different browsers and internet service providers. The problem has appeared before, for example on 6th/7th June 2019. At the time, it was explained to be related to ongoing software deployments on WP:Fragen zur Wikipedia.

Update: Evidence so far suggests that users in several European countries are affected, including Switzerland, Germany, Finland and France. This would probably point to some problem in the Amsterdam cluster.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Some additional observations:

Screenshot from 2019-06-20 11-03-39.png (1×1 px, 161 KB)

I'm now half-way in the process of rebooting all cache nodes with a patched kernel fixing CVE-2019-11477 and related vulnerabilities. Once I'm done with that, we can re-enable SACK and see if that changes anything. CC @MoritzMuehlenhoff.

Not limited to dewiki / Germnany (unsurprisingly), there have been a bunch of reports from huwiki/Hungary as well.

I saw many slow en wiki page loads yesterday, including missing skins. (Logged in user, Europe.)

@Yann: There is nothing new in that thread? Feel free to add {{tracked|Txxxxxx}} on-wiki in such cases.

Interested in the solving of this pb. What I can say is that a common template (https://www.mediawiki.org/wiki/Template:Graph:Lines) whichever wiki it is used (de, fr, it, etc), does not render the graph if linked to a wikidata query.
What is funny, is the fact that if you hit "preview", the graph is viewable. But when saving the changes, no graphs, only a bad square/blank shown.
Thanks for your work!

Seems to be better. Solved by the rebooting mentioned by @ema ? I don't experience that slowness anymore.

Agree to @Gestumblindi !

I just tried and everything is less than 3 seconds.

+1, everything running smoothly now, including API queries.

PM3 renamed this task from Sometimes some pages load slowly for European users (due to some factor outside of Wikimedia cluster) to Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster).Jun 20 2019, 2:40 PM

My guess is that the beginning of this problem correlates with the beginning of the fetch failures in the first graph panel here:
https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-7d&to=now

@CDanis Your guess seems spot on, fetch errors went to essentially 0 at 2019-06-20T10:00 and after that we started getting the first reports that the problem was fixed. See also the graph of "Resource temporarily unavailable - straight insufficient bytes" errors: https://logstash.wikimedia.org/goto/4f9a6fdf4665b439713b7e40226512dd. "Resource temporarily unavailable" is EAGAIN, indicating a timeout.

Speculation: @Wurgl reported response times of ~180 seconds. There are 3 varnishes between esams users and the application layer (esams varnish-fe and varnish-be, eqiad varnish-be). A theory for what might have happened is thus that responses affected by the issue got delayed by a 60s timeout happening 3x. Cache reboots for (unrelated) kernel/varnish upgrades solved whatever weird state varnish got itself into.

@ema It was not a single timeout. A chunk of data (the start of the page) was rendered by the browser very fast, but it stuck. And from that moment the browser added line by line, sometimes character by character, sometimes larger chunks arrived. When I scrolled down to the end of the page, I could read the data faster than it arrived. As I said in the IRC channel: Slow like a acoustic coupler.

^^ I saw exactly what Wurgl reports, browsing long pages on en-wiki (eg the Fram case) in London. Now seems okay.

Shall we set this to resolved then? The active discussion on how to prevent this in the future seems to be taking place under another ticket.

Krinkle assigned this task to ema.

Sorry to reopen that issue, but the behaviour is back :-(

I see the slowness it again. But when running that test-loop, just one in 100 tries: again 3 minutes, 1,783 seconds all 99 other tries are within 1,4 and 2,9 seconds

I just experienced this too. One minute it’s fast, the next it’s really slow.

I don't experience any problems since Thursday. Today also everything is running smoothly.

One out of 100 could be just some normal network congestion or temporary local system lag.

It is strange, really strange. I have seen that slowness three times within a few minutes on my watchlist, but now I do not see it anymore? Did the servers take a break for a cup of tea?

Did not read all comments, but want to say that this problem is way older than several weeks.
It occurred for years, but was very rare.

We believe that Varnish fetch failures might be related to this issue, investigation is ongoing T226375

@Community-Relations Just a heads-up in case you've heard anything around this in the communities related to slow page views or time outs. These kinds of reports are usually out of our control and all unrelated to each other, but many of the reports in the past 7 days turn out to be related to each other and in an area we are responsible for.

It should have mostly settled by now, but look out for anything new :)

(The task is titled "European users", but more precisely it affects users for whom we route Wikipedia connection routed to the Amsterdam cluster. This includes many areas might think of as being outside Europe.)

ema renamed this task from Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) to Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster).Jun 25 2019, 8:07 AM

Reported again in fiwiki by three users. One commented also that mobile site was working better.

Can we get approximate times for these last reports, @Zache ? (UTC if possible)

@ArielGlenn, two fiwiki users reported about this between 03–03:35 (UTC) and one user (who said that only mobile phone interface worked) at 7:49, 27 June 2019 (UTC) .

The 03-03:35 incidents were likely related to network issues we were experiencing during that time. The later report seems like a separate incident.

@Community-Relations Just a heads-up in case you've heard anything around this in the communities related to slow page views or time outs. These kinds of reports are usually out of our control and all unrelated to each other, but many of the reports in the past 7 days turn out to be related to each other and in an area we are responsible for.

It should have mostly settled by now, but look out for anything new :)

Ciao, sorry for not replying earlier - in the future, we recommend using the Specialists Support tag and exploring better options for quick interventions at https://office.wikimedia.org/wiki/Community_Relations#Support :)

Personally, I haven't seen anything other than https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#was_en.WP_out_of_service_just_now? (VPT is a good place to look for reports as it's often also used by non en.wp editors.)

If there's anything specific we should ask around and any wiki you want targeted with such requests, please let us know. TYVM!

If there's anything specific we should ask around and any wiki you want targeted with such requests, please let us know. TYVM!

It is something we can ask in Tech News if needed.

I'm writing the next issue now: how would you describe the problem is one or two short sentences?

To clarify, Trizek isn't asking me.

@Trizek-WMF: personally, I've tried and failed for days to reproduce the issue. My understanding is that occasionally some page loads for logged-in users take a very long time to complete, with characters slowly showing up on the screen.

I'm not sure if people are still encountering this.

@Trizek-WMF: It would help if you stated what exactly are you interested in: The problem resolution from a technical perspective or the end user experience. The end user experience is well summed up in Gestumblindi's above comment of Tue, Jun 18, 23:05 (the ''acoustic coupler'' metaphor). At least that is how I experienced it.

Problem was reproduced by me just now, on ruwiki.

Just happened to me again as well. Unfortunately i didn't have the inspector open when loading the page, so i couldn't find by any of the HTTP headers. Next call was just fine. This occurred around: 28 Jun 2019 12:17:11 GMT

Here is a screenshot of laggy connection:

lags.png (441×968 px, 121 KB)

As you can see, packets are 622 bytes in size, received every second, which means speed of 0.5 KiB/s.
And no retransmissions. Network connection works fine from my side.

This was published in https://meta.wikimedia.org/wiki/Tech/News/2019/27 . Anything else required from Specialists here? Thanks.

Looks like fixed many months ago.