Page MenuHomePhabricator

Resources and pages occasionally take seconds to respond or fail
Closed, DuplicatePublic

Description

I'm frequently getting unstyled pages or pages which take many seconds to finish loading on the English Wikipedia and other wikis in the last few days. Some requests just fail:

Screenshot_20180307_094811.png (542×1 px, 157 KB)

The load.php URL which gave a 503 in the example is (note it reflects some personal preferences, for instance the en-gb).

https://en.wikipedia.org/w/load.php?debug=false&lang=en-gb&modules=ext.3d%2CeventLogging%2CnavigationTiming%2Cpopups%2CwikimediaEvents%7Cext.centralNotice.choiceData%2Cdisplay%2CgeoIP%2CimpressionDiet%2CkvStore%2CkvStoreMaintenance%2CstartUp%7Cext.centralauth.ForeignApi%7Cext.centralauth.centralautologin.clearcookie%7Cext.cite.a11y%7Cext.echo.api%2Cinit%7Cext.eventLogging.subscriber%7Cext.popups.images%7Cext.uls.common%2Ccompactlinks%2Ceventlogger%2Cinit%2Cinterface%2Cpreferences%2Cwebfonts%7Cext.visualEditor.desktopArticleTarget.init%7Cext.visualEditor.supportCheck%2CtargetLoader%2CtempWikitextEditorWidget%2Ctrack%2Cve%7Cext.wikimediaEvents.loggedin%7Cjquery.accessKeyLabel%2CbyteLength%2CcheckboxShiftClick%2Cclient%2Ccookie%2CgetAttrs%2Chidpi%2ChighlightText%2Cmw-jump%2Cspinner%2Csuggestions%2CtextSelection%7Cjquery.uls.data%7Cmediawiki.ForeignApi%2CRegExp%2CString%2CTitle%2CUri%2Capi%2Ccldr%2Ccookie%2Cexperiments%2CjqueryMsg%2Clanguage%2Cnotify%2CsearchSuggest%2Cstorage%2Ctemplate%2Cuser%2Cutil%7Cmediawiki.ForeignApi.core%7Cmediawiki.api.options%2Cuser%2Cwatch%7Cmediawiki.editfont.styles%7Cmediawiki.language.data%2Cinit%7Cmediawiki.libs.pluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cmediawiki.page.watch.ajax%7Cmediawiki.template.mustache%2Cregexp%7Cmediawiki.ui.button%2Cicon%7Cmmv.bootstrap%2Chead%7Cmmv.bootstrap.autostart%7Coojs%2Csite%7Cschema.UniversalLanguageSelector%7Cuser.defaults%7Cwikibase.client.linkitem.init&skin=monobook&version=1kmh2ku

I did not see obvious network/ Traffic issues such as packet loss. The issue also doesn't seem to be confined to CSS as I thought until yesterday (T181877 is about that).

Event Timeline

Joe triaged this task as High priority.Mar 7 2018, 8:04 AM
Joe added a project: SRE.
jcrespo subscribed.

This has been handled by the traffic engineers and no further problems should happen in the next hours/days (apparently one edge server was having load issues:

https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&from=1520240936821&to=1520413736822&panelId=3&fullscreen

, but I am not an expert on those to give proper details). Probably some followup is needed to either prevent things from happening or put checks to detect it earlier.

The host cp3033 was close to its weekly varnish-be scheduled restart, and I've handled it by manually restarting the varnish backend instance. We used to have frequent issues in text-esams during the EU morning, which were mitigated by disabling varnish_be<->varnish_be max_connections as described here: https://phabricator.wikimedia.org/T175803#3663509

By looking at various varnish statistics , it seems that the problem here is due to the well known limitations of the file storage engine (objects not being expunged from varnish-be cache, mbox lag spiking, esams<->eqiad connections piling up, ...).

I suspect this ticket, the above-referenced T175803, and T181315 are all inter-related or possibly pointing at the same underlying issue, just from different perspectives.