Page MenuHomePhabricator

nuke_limit often reached on esams varnish frontends
Closed, ResolvedPublic

Description

I'm getting some entity RDF from WD and manipulating it in some way.
WD sometimes randomly cuts off the RDF.
Eg this command sometimes shows different last lines, and the very last line is cut short:

curl -sL https://www.wikidata.org/entity/Q61707895.nt | tail -10

I noticed this when formatting as turtle (I filter out non-English prop names/descriptions).
(Grab the prefixes from https://www.wikidata.org/entity/Q61707895.ttl and save to prefixes.ttl)
Eg these commands, run consecutively, reported errors at different lines, because the RDF was cut short.

curl -sL https://www.wikidata.org/entity/Q61707895.nt | cat prefixes.ttl - | riot --formatted ttl -syntax ttl - > artefact1.ttl
11:19:38 ERROR riot                 :: [line: 4509, col: 105] Not a hexadecimal character (end of file)

curl -sL https://www.wikidata.org/entity/Q61707895.nt | cat prefixes.ttl - | riot --formatted ttl -syntax ttl - > artefact1.ttl
11:19:41 ERROR riot                 :: [line: 4221, col: 83] Broken IRI (End of file)

Right now (a few minutes later) WD returns the full RDF, and I'm glad for it.
But it's troubling that such simple/fundamental operation may be cut off.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2019, 9:33 AM
Lydia_Pintscher triaged this task as High priority.Mar 3 2019, 5:04 PM
Lydia_Pintscher moved this task from Incoming to Ready to estimate on the Wikidata-Campsite board.
Addshore added a project: Operations.EditedMar 5 2019, 10:56 AM
Addshore added subscribers: ema, Addshore.

I did a little investigation into this this morning.

Personally I couldn't reproduce.
I couldn't find any indication of mediawiki/wikibase doing anything wrong in logstash.
Special:EntityData which serves this URI is cachable in varnish so I decided to have a dig around in the webrequest logs for the time period that I believe we are talking about (13th Feb).

The sequence of cache hits looks a little odd to me with the first 9 or 10 cache hits all returning differing response sizes (from my knowledge I don't know why that would happen)
Once ached in the back layer of varnish the response size looks to be complete and consistent.

1 dt sequence cache_status response_size x_cache
2
3X 2019-02-13T09:13:43 3421889674 miss 753664 cp1087 pass, cp3041 miss, cp3040 miss
410 2019-02-13T09:15:36 3422739696 hit-local 594787 cp1087 pass, cp3041 hit/1, cp3040 miss
52 2019-02-13T09:17:54 3423779520 hit-local 551123 cp1087 pass, cp3041 hit/2, cp3040 miss
63 2019-02-13T09:18:10 3423899544 hit-local 462177 cp1087 pass, cp3041 hit/3, cp3040 miss
79 2019-02-13T09:18:29 3424049060 hit-local 385824 cp1087 pass, cp3041 hit/4, cp3040 miss
84 2019-02-13T09:19:36 3424555932 hit-local 551652 cp1087 pass, cp3041 hit/5, cp3040 miss
914 2019-02-13T09:19:39 3424578180 hit-local 516680 cp1087 pass, cp3041 hit/6, cp3040 miss
106 2019-02-13T09:20:07 3424782353 hit-local 648154 cp1087 pass, cp3041 hit/7, cp3040 miss
1116 2019-02-13T09:20:30 3424962802 hit-local 447385 cp1087 pass, cp3041 hit/8, cp3040 miss
1211 2019-02-13T09:27:01 3427951526 hit-local 1065589 cp1087 pass, cp3041 hit/9, cp3040 miss
137 2019-02-13T09:27:05 3427980813 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/1
1413 2019-02-13T09:27:09 3428007224 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/2
1517 2019-02-13T09:27:15 3428060797 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/3
168 2019-02-13T09:27:37 3428226459 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/4
1715 2019-02-13T09:50:41 3438961924 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/5
1812 2019-02-13T09:52:41 3439917553 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/6
191 2019-02-13T10:08:37 3447470735 hit-front 1065589 cp1087 pass, cp3041 hit/9, cp3040 hit/7

Data gathered using P8155

Had a little chat in IRC on operations and going to poke ops to have a bit of a closer look.

ema moved this task from Triage to Caching on the Traffic board.

I haven't done any investigation yet, but it sounds similar to T215389.

ema renamed this task from Wikidata sometimes cuts off entity RDF to Some esams<->eqiad varnish backend connections closed by peer .Mar 7 2019, 10:06 AM
ema added subscribers: akosiaris, Stashbot, ayounsi and 2 others.
ema added a comment.Mar 7 2019, 10:18 AM

Varnishlog of the varnish backend instance serving the request in esams reports the following:

-   ReqMethod      GET
-   ReqURL         /wiki/Special:EntityData/Q61707895.nt?cachebust=1551953444
-   ReqProtocol    HTTP/1.1
[...]
-   Timestamp      Fetch: 1551953445.508907 0.426942 0.426942
-   RespProtocol   HTTP/1.1
-   RespStatus     200
[...]
-   Debug          "Write error, retval = -1, len = 2437, errno = Connection reset by peer"
-   Timestamp      Resp: 1551953445.709184 0.627218 0.200252

Same error for the request reported in T215389:

-   ReqMethod      GET
-   ReqURL         /blubber/linux-amd64/blubber?cachebust=1551952223
-   ReqProtocol    HTTP/1.1
[...]
-   Timestamp      Fetch: 1551952223.603492 0.085144 0.085144
-   RespProtocol   HTTP/1.1
-   RespStatus     200
[...]
-   Debug          "Write error, retval = -1, len = 1362, errno = Connection reset by peer"
-   Timestamp      Resp: 1551952223.928025 0.409676 0.324517

Tickets merged as this very much looks like the same issue.

ema added a comment.EditedMar 7 2019, 3:02 PM

Those connection resets on the varnish backend layer happen when frontend caches are full and varnish cannot make space for a newly fetched object body:

--  ExpKill        LRU_Cand p=0x7f7bbc64f740 f=0x0 r=1
--  ExpKill        LRU x=980813893
# [...] the above happening 50 times, our current setting for nuke_limit
--  ExpKill        LRU_Exhausted
--  FetchError     Could not get storage
--  BackendClose   11217 vcl-ed50cc64-c0ad-4266-89a4-9e4539972e1a.be_cp3033_esams_wmnet
# varnish frontend closing the connection to varnish backend

Interestingly, the problem is not reproducible with larger objects, as varnish autonomously decides they're too large and does not cache them (see the "pass" entry in X-Cache):

$ curl -L --resolve releases.wikimedia.org:443:91.198.174.192 --http1.1 -v -o /dev/null https://releases.wikimedia.org/parsoid/parsoid_0.10.0all_all.deb?x=$RANDOM 2>&1 | egrep "Content-Length|X-Cache:"
< Content-Length: 46716380
< X-Cache: cp1077 pass, cp3033 miss, cp3033 pass

We are currently limiting the maximum object size on cache_upload frontends to 256K. The assumption was that cache_text would not really benefit from the cutoff given that the dataset is made of smaller objects (compared to upload). Let's add the limit to cache_text too.

Change 494937 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: apply frontend size-based cutoff to text too

https://gerrit.wikimedia.org/r/494937

ema renamed this task from Some esams<->eqiad varnish backend connections closed by peer to nuke_limit often reached on esams varnish frontends.Mar 7 2019, 3:03 PM
ema added a comment.Mar 7 2019, 5:05 PM

Interestingly, the problem is not reproducible with larger objects, as varnish autonomously decides they're too large and does not cache them

False! We do have misc VCL code in place to avoid caching objects larger than 10M.

Change 494937 merged by Ema:
[operations/puppet@production] varnish: text/misc frontend size-based cutoffs

https://gerrit.wikimedia.org/r/494937

Mentioned in SAL (#wikimedia-operations) [2019-03-13T08:51:14Z] <ema> cp3030: wipe frontend cache to get rid of large objects T216006

ema added a comment.Mar 15 2019, 5:44 PM

The issue should be fixed. @VladimirAlexiev, @Addshore: can you confirm?

ayounsi removed a subscriber: ayounsi.Mar 15 2019, 5:45 PM
Addshore closed this task as Resolved.Mar 19 2019, 7:59 AM
Addshore assigned this task to ema.
Addshore moved this task from incoming to monitoring on the Wikidata board.