nuke_limit often reached on esams varnish frontends
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	VladimirAlexiev
	Feb 13 2019, 9:33 AM

Description

I'm getting some entity RDF from WD and manipulating it in some way.
WD sometimes randomly cuts off the RDF.
Eg this command sometimes shows different last lines, and the very last line is cut short:

curl -sL https://www.wikidata.org/entity/Q61707895.nt | tail -10

I noticed this when formatting as turtle (I filter out non-English prop names/descriptions).
(Grab the prefixes from https://www.wikidata.org/entity/Q61707895.ttl and save to prefixes.ttl)
Eg these commands, run consecutively, reported errors at different lines, because the RDF was cut short.

curl -sL https://www.wikidata.org/entity/Q61707895.nt | cat prefixes.ttl - | riot --formatted ttl -syntax ttl - > artefact1.ttl
11:19:38 ERROR riot                 :: [line: 4509, col: 105] Not a hexadecimal character (end of file)

curl -sL https://www.wikidata.org/entity/Q61707895.nt | cat prefixes.ttl - | riot --formatted ttl -syntax ttl - > artefact1.ttl
11:19:41 ERROR riot                 :: [line: 4221, col: 83] Broken IRI (End of file)

Right now (a few minutes later) WD returns the full RDF, and I'm glad for it.
But it's troubling that such simple/fundamental operation may be cut off.

Details

	Subject	Repo	Branch	Lines +/-
	varnish: text/misc frontend size-based cutoffs	operations/puppet	production	+15 -3

Customize query in gerrit

Related Objects

Mentioned In: T213214: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200/Loading failed for the <script> with source ...)
Mentioned Here: T215389: esams cache layer mangles downloads of specific url
P8155 (An Untitled Masterwork)
P8154 (An Untitled Masterwork)

Event Timeline

VladimirAlexiev created this task.Feb 13 2019, 9:33 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2019, 9:33 AM

Lydia_Pintscher triaged this task as High priority.Mar 3 2019, 5:04 PM

Lydia_Pintscher added a project: Wikidata-Campsite.

Lydia_Pintscher moved this task from Incoming to Ready to estimate on the Wikidata-Campsite board.

I did a little investigation into this this morning.

Personally I couldn't reproduce.
I couldn't find any indication of mediawiki/wikibase doing anything wrong in logstash.
Special:EntityData which serves this URI is cachable in varnish so I decided to have a dig around in the webrequest logs for the time period that I believe we are talking about (13th Feb).

The sequence of cache hits looks a little odd to me with the first 9 or 10 cache hits all returning differing response sizes (from my knowledge I don't know why that would happen)
Once ached in the back layer of varnish the response size looks to be complete and consistent.

P8154 (An Untitled Masterwork)

1	dt sequence cache_status response_size x_cache
2
3	X 2019-02-13T09:13:43 3421889674 miss 753664 cp1087 pass, cp3041 miss, cp3040 miss
4	10 2019-02-13T09:15:36 3422739696 hit-local 594787 cp1087 pass, cp3041 hit/1, cp3040 miss
5	2 2019-02-13T09:17:54 3423779520 hit-local 551123 cp1087 pass, cp3041 hit/2, cp3040 miss
6	3 2019-02-13T09:18:10 3423899544 hit-local 462177 cp1087 pass, cp3041 hit/3, cp3040 miss
7	9 2019-02-13T09:18:29 3424049060 hit-local 385824 cp1087 pass, cp3041 hit/4, cp3040 miss
8	4 2019-02-13T09:19:36 3424555932 hit-local 551652 cp1087 pass, cp3041 hit/5, cp3040 miss
9	14 2019-02-13T09:19:39 3424578180 hit-local 516680 cp1087 pass, cp3041 hit/6, cp3040 miss
10	6 2019-02-13T09:20:07 3424782353 hit-local 648154 cp1087 pass, cp3041 hit/7, cp3040 miss
11	16 2019-02-13T09:20:30 3424962802 hit-local 447385 cp1087 pass, cp3041 hit/8, cp3040 miss
12	11 2019-02-13T09:27:01 3427951526 hit-local 1065589 cp1087 pass, cp3041 hit/9, cp3040 miss
13	7 2019-02-13T09:27:05 3427980813 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/1
14	13 2019-02-13T09:27:09 3428007224 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/2
15	17 2019-02-13T09:27:15 3428060797 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/3
16	8 2019-02-13T09:27:37 3428226459 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/4
17	15 2019-02-13T09:50:41 3438961924 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/5
18	12 2019-02-13T09:52:41 3439917553 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/6
19	1 2019-02-13T10:08:37 3447470735 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/7

Data gathered using P8155

Had a little chat in IRC on operations and going to poke ops to have a bit of a closer look.

Addshore added a project: User-Addshore.Mar 5 2019, 10:58 AM

I haven't done any investigation yet, but it sounds similar to T215389.

• ema renamed this task from Wikidata sometimes cuts off entity RDF to Some esams<->eqiad varnish backend connections closed by peer .Mar 7 2019, 10:06 AM

• ema merged a task: T215389: esams cache layer mangles downloads of specific url.

• ema added subscribers: akosiaris, Stashbot, ayounsi and 2 others.

Varnishlog of the varnish backend instance serving the request in esams reports the following:

-   ReqMethod      GET
-   ReqURL         /wiki/Special:EntityData/Q61707895.nt?cachebust=1551953444
-   ReqProtocol    HTTP/1.1
[...]
-   Timestamp      Fetch: 1551953445.508907 0.426942 0.426942
-   RespProtocol   HTTP/1.1
-   RespStatus     200
[...]
-   Debug          "Write error, retval = -1, len = 2437, errno = Connection reset by peer"
-   Timestamp      Resp: 1551953445.709184 0.627218 0.200252

Same error for the request reported in T215389:

-   ReqMethod      GET
-   ReqURL         /blubber/linux-amd64/blubber?cachebust=1551952223
-   ReqProtocol    HTTP/1.1
[...]
-   Timestamp      Fetch: 1551952223.603492 0.085144 0.085144
-   RespProtocol   HTTP/1.1
-   RespStatus     200
[...]
-   Debug          "Write error, retval = -1, len = 1362, errno = Connection reset by peer"
-   Timestamp      Resp: 1551952223.928025 0.409676 0.324517

Tickets merged as this very much looks like the same issue.

Those connection resets on the varnish backend layer happen when frontend caches are full and varnish cannot make space for a newly fetched object body:

--  ExpKill        LRU_Cand p=0x7f7bbc64f740 f=0x0 r=1
--  ExpKill        LRU x=980813893
# [...] the above happening 50 times, our current setting for nuke_limit
--  ExpKill        LRU_Exhausted
--  FetchError     Could not get storage
--  BackendClose   11217 vcl-ed50cc64-c0ad-4266-89a4-9e4539972e1a.be_cp3033_esams_wmnet
# varnish frontend closing the connection to varnish backend

Interestingly, the problem is not reproducible with larger objects, as varnish autonomously decides they're too large and does not cache them (see the "pass" entry in X-Cache):

$ curl -L --resolve releases.wikimedia.org:443:91.198.174.192 --http1.1 -v -o /dev/null https://releases.wikimedia.org/parsoid/parsoid_0.10.0all_all.deb?x=$RANDOM 2>&1 | egrep "Content-Length|X-Cache:"
< Content-Length: 46716380
< X-Cache: cp1077 pass, cp3033 miss, cp3033 pass

We are currently limiting the maximum object size on cache_upload frontends to 256K. The assumption was that cache_text would not really benefit from the cutoff given that the dataset is made of smaller objects (compared to upload). Let's add the limit to cache_text too.

Change 494937 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: apply frontend size-based cutoff to text too

https://gerrit.wikimedia.org/r/494937

gerritbot added a project: Patch-For-Review.Mar 7 2019, 3:02 PM

• ema renamed this task from Some esams<->eqiad varnish backend connections closed by peer to nuke_limit often reached on esams varnish frontends.Mar 7 2019, 3:03 PM

In T216006#5008346, @ema wrote:

Interestingly, the problem is not reproducible with larger objects, as varnish autonomously decides they're too large and does not cache them

False! We do have misc VCL code in place to avoid caching objects larger than 10M.

Change 494937 merged by Ema:
[operations/puppet@production] varnish: text/misc frontend size-based cutoffs

https://gerrit.wikimedia.org/r/494937

CDanis mentioned this in T213214: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200/Loading failed for the <script> with source ...).Mar 12 2019, 6:07 PM

Mentioned in SAL (#wikimedia-operations) [2019-03-13T08:51:14Z] <ema> cp3030: wipe frontend cache to get rid of large objects T216006

The issue should be fixed. @VladimirAlexiev, @Addshore: can you confirm?

ayounsi unsubscribed.Mar 15 2019, 5:45 PM

Addshore closed this task as Resolved.Mar 19 2019, 7:59 AM

Addshore assigned this task to • ema.

Addshore moved this task from incoming to monitoring on the Wikidata board.

It's fixed!

nuke_limit often reached on esams varnish frontendsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

nuke_limit often reached on esams varnish frontends
Closed, ResolvedPublic
Actions