Page MenuHomePhabricator

Connection closed while downloading PDF of articles
Closed, ResolvedPublic

Description

Going to left column bar and clicking on Download as PDF, the on the Download blue button, Firefox browser tells me that connection was closed during the transfer.
I can reproduce it with wget. This is the same for every page I tried on frwiki. It doesn't seems to affect enwiki, eswiki nor ruwiki.
May it be related to Vector changes ("Desktop improvements") deployed month ago specificly on frwiki?

$  LANG=en_US wget https://fr.wikipedia.org/api/rest_v1/page/pdf/Test
--2020-10-24 00:10:51--  https://fr.wikipedia.org/api/rest_v1/page/pdf/Test
Resolving fr.wikipedia.org (fr.wikipedia.org)... 2620:0:862:ed1a::1, 91.198.174.192
Connecting to fr.wikipedia.org (fr.wikipedia.org)|2620:0:862:ed1a::1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1228803 (1.2M) [application/pdf]
Saving to: 'Test.5'

Test.5                                               48%[======================================================>                                                            ] 581.95K  1.34MB/s    in 0.4s    

2020-10-24 00:10:52 (1.34 MB/s) - Connection closed at byte 595912. Retrying.

--2020-10-24 00:10:53--  (try: 2)  https://fr.wikipedia.org/api/rest_v1/page/pdf/Test
Connecting to fr.wikipedia.org (fr.wikipedia.org)|2620:0:862:ed1a::1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1228803 (1.2M) [application/pdf]
Saving to: 'Test.5'

Test.5                                               75%[=====================================================================================>                             ] 902.21K  1.33MB/s    in 0.7s    

2020-10-24 00:10:53 (1.33 MB/s) - Connection closed at byte 923864. Retrying.

--2020-10-24 00:10:55--  (try: 3)  https://fr.wikipedia.org/api/rest_v1/page/pdf/Test
Connecting to fr.wikipedia.org (fr.wikipedia.org)|2620:0:862:ed1a::1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1228803 (1.2M) [application/pdf]
Saving to: 'Test.5'

Test.5                                               48%[======================================================>                                                            ] 582.21K  1.36MB/s    in 0.4s    

2020-10-24 00:10:56 (1.36 MB/s) - Connection closed at byte 923864. Retrying.
...

Can reproduce with hewiki, other wiki where "Desktop improvements" were enabled:

framawiki@quarry-web-01:/tmp$ LANG=en_US wget https://he.wikipedia.org/api/rest_v1/page/pdf/Paris
--2020-10-23 22:19:58--  https://he.wikipedia.org/api/rest_v1/page/pdf/Paris
Resolving he.wikipedia.org (he.wikipedia.org)... 208.80.154.224, 2620:0:861:ed1a::1
Connecting to he.wikipedia.org (he.wikipedia.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2851110 (2.7M) [application/pdf]
Saving to: 'Paris'

Paris                                                28%[===============================>                                                                                   ] 789.17K  --.-KB/s    in 0.1s    

2020-10-23 22:20:01 (5.56 MB/s) - Read error at byte 808105/2851110 (A TLS packet with unexpected length was received.). Retrying.

--2020-10-23 22:20:02--  (try: 2)  https://he.wikipedia.org/api/rest_v1/page/pdf/Paris
Connecting to he.wikipedia.org (he.wikipedia.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2851110 (2.7M) [application/pdf]
Saving to: 'Paris'

Paris                                                27%[==============================>                                                                                    ] 758.45K  --.-KB/s    in 0.03s   

2020-10-23 22:20:02 (27.8 MB/s) - Connection closed at byte 808105. Retrying.

--2020-10-23 22:20:04--  (try: 3)  https://he.wikipedia.org/api/rest_v1/page/pdf/Paris
Connecting to he.wikipedia.org (he.wikipedia.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2851110 (2.7M) [application/pdf]
Saving to: 'Paris'

Paris                                                32%[====================================>                                                                              ] 911.67K  --.-KB/s    in 0.04s   

2020-10-23 22:20:04 (22.1 MB/s) - Connection closed at byte 933550. Retrying.

--2020-10-23 22:20:07--  (try: 4)  https://he.wikipedia.org/api/rest_v1/page/pdf/Paris
Connecting to he.wikipedia.org (he.wikipedia.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2851110 (2.7M) [application/pdf]
Saving to: 'Paris'

Paris                                                31%[==================================>                                                                                ] 866.85K  --.-KB/s    in 0.03s   

2020-10-23 22:20:08 (30.6 MB/s) - Connection closed at byte 933550. Retrying.
...

Note different error message: Read error at byte 808105/2851110 (A TLS packet with unexpected length was received.). Retrying.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 23 2020, 10:23 PM
Framawiki updated the task description. (Show Details)Oct 23 2020, 10:23 PM
Urbanecm added a subscriber: Urbanecm.

Can reproduce from toolforge:

urbanecm@tools-sgebastion-07  ~/tmp
$ wget https://fr.wikipedia.org/api/rest_v1/page/pdf/Paris
--2020-10-24 01:41:54--  https://fr.wikipedia.org/api/rest_v1/page/pdf/Paris
Resolving fr.wikipedia.org (fr.wikipedia.org)... 208.80.154.224, 2620:0:861:ed1a::1
Connecting to fr.wikipedia.org (fr.wikipedia.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9145887 (8.7M) [application/pdf]
Saving to: ‘Paris.2’

Paris.2                                                6%[======>                                                                                                               ] 541.61K  --.-KB/s    in 0.02s

2020-10-24 01:41:54 (31.0 MB/s) - Connection closed at byte 554611. Retrying.

--2020-10-24 01:41:55--  (try: 2)  https://fr.wikipedia.org/api/rest_v1/page/pdf/Paris
Connecting to fr.wikipedia.org (fr.wikipedia.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9145887 (8.7M) [application/pdf]
Saving to: ‘Paris.2’

Paris.2                                               18%[====================>                                                                                                 ]   1.59M  --.-KB/s    in 0.06s

2020-10-24 01:41:56 (26.0 MB/s) - Connection closed at byte 1672698. Retrying.

--2020-10-24 01:41:58--  (try: 3)  https://fr.wikipedia.org/api/rest_v1/page/pdf/Paris
Connecting to fr.wikipedia.org (fr.wikipedia.org)|208.80.154.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9145887 (8.7M) [application/pdf]
Saving to: ‘Paris.2’

Paris.2                                               15%[=================>                                                                                                    ]   1.36M  --.-KB/s    in 0.04s

2020-10-24 01:41:58 (30.5 MB/s) - Connection closed at byte 1672698. Retrying.

^C
urbanecm@tools-sgebastion-07  ~/tmp
$

Adding Traffic as well to look into this.

Restricted Application added a project: Operations. · View Herald TranscriptOct 23 2020, 11:42 PM

See also this Grafana dashboard showing increase of daily PDF rendering by Proton from 80k to 20k, since beginning of August.

CDanis added a subscriber: CDanis.Oct 24 2020, 4:33 PM

See also this Grafana dashboard showing increase of daily PDF rendering by Proton from 80k to 20k, since beginning of August.

Looks like that's when Proton migrated to Kubernetes:
https://grafana.wikimedia.org/d/llIEd7MMz/proton?viewPanel=68&orgId=1&from=now-30d&to=now

The dashboard linked by @Framawiki is probably overdue for deletion.

CDanis edited projects, added serviceops, Proton; removed Traffic.Oct 24 2020, 4:37 PM

See also this Grafana dashboard showing increase of daily PDF rendering by Proton from 80k to 20k, since beginning of August.

Looks like that's when Proton migrated to Kubernetes:
https://grafana.wikimedia.org/d/llIEd7MMz/proton?viewPanel=68&orgId=1&from=now-30d&to=now

The dashboard linked by @Framawiki is probably overdue for deletion.

Yes, I 've just deleted it, alongside the old mobileapps dashboard.

A look at the various panels of https://grafana.wikimedia.org/d/llIEd7MMz/proton?orgId=1&from=now-7d&to=now doesn't raise any particular suspicions. There's no limits, no throttling. Plus, issues at those things that would exhibit symptoms everywhere, not just specific wikis.

There is a 6x to 7x increase in new jobs starting 2020-10-26 15:20 UTC, but that's well after the reporting in this task and despite the seemingly huge increase, the actual numbers are very low (from 0.1 to ~0.7 rps), so it's irrelevant.

LGoto assigned this task to Jgiannelos.Oct 28 2020, 3:46 PM
LGoto triaged this task as Low priority.Wed, Nov 4, 4:41 PM

I can also reproduce this on prod:
GETing the prod endpoint to render a pdf for the article "Dog" raises the connection error: https://phabricator.wikimedia.org/P13234

On my local env and querying directly the service, it works fine:
Local env: https://phabricator.wikimedia.org/P13235
proton service @ eqiad: https://phabricator.wikimedia.org/P13236

Jgiannelos added a comment.EditedFri, Nov 6, 12:34 PM

Something that could be interesting for debugging purposes is that on eg. en.wikipedia.org for two requests (1 succesful and 1 failing) both return 200 as a status code.
This means it could be the case that the issue won't show up in a metric based on HTTP status codes (40x, 50x errors)

Jgiannelos added a comment.EditedFri, Nov 6, 2:14 PM

I wrote a quick script to fetch some random pages and check if the rendered PDF is valid. Here are the runs for en,es and fr wikipedias
https://phabricator.wikimedia.org/P13239
https://phabricator.wikimedia.org/P13240
https://phabricator.wikimedia.org/P13241

Out of a sample of 20 random pages:

en.wikipedia.org has 16 valid/4 invalid PDF renders
es.wikipedia.org has 11 valid/9 invalid PDF renders
fr.wikipedia.org has 4 valid/16 invalid PDF renders

No issues when running the same against my local env/prod service pointed to eg. fr.wikipedia.org

So this is not specific to frwiki it seems. Is there perhaps some correlation between page size and failure rate? Or maybe some failure rate and response time?

@akosiaris More from debugging on this issue:

  • Querying directly the RESTBase service for a PDF render of an article doesn't reproduce the issue
  • I couldn't find any correlation between valid/invalid PDF renders and response time or size
    • Values where overlapping in both cases
  • Using wget:
    • For an article that fails to properly render a pdf at first (with a connection closed error)
    • If I let it retry GETing the endpoint 4-5 times at some point it succesfully downloads the file.
    • Next request on the same URL is succesful at first try

I guess, based on the first finding, the issue looks like its in a layer above restbase, closer to the proxies

sdkim removed Jgiannelos as the assignee of this task.Tue, Nov 10, 4:40 PM
sdkim added subscribers: Jgiannelos, sdkim.

Given we, Product Infra, are not finding issues at our service level investigation, we're asking that SRE take a look as next steps in helping to resolve. @akosiaris would you be a good person to assign?

akosiaris added a comment.EditedTue, Nov 10, 5:06 PM

@akosiaris More from debugging on this issue:

  • Querying directly the RESTBase service for a PDF render of an article doesn't reproduce the issue

Looking at the pastes above, I must say I don't see a direct RESTBase service (restbase.discovery.wmnet or restbase.svc.eqiad.wmnet or restbase.svc.codfw.wmnet) call. There is a single wget against the proton service, but I don't think that gives us enough data. However, I think we indeed need to test this more.

  • I couldn't find any correlation between valid/invalid PDF renders and response time or size

I 've had issues with that as well. I 've modified your script in P13257 (thanks for it btw) in and I got the following stats for 1000 requests spread out across a number of wikis via python's random.choice().

I 've run it against proton.svc.eqiad.wmnet and against the public endpoints as well (<lang>.wikipedia.org). Results below

For the public endpoints:

{200: {'cl-fails-bytes': 217,
       'cl-matches-bytes': 783,
       'invalid': 217,
       'valid': 783},
 'ar': {'cl-fails-bytes': 72, 'cl-matches-bytes': 4, 'invalid': 72, 'valid': 4},
 'bg': {'cl-fails-bytes': 20,
        'cl-matches-bytes': 71,
        'invalid': 20,
        'valid': 71},
 'de': {'cl-fails-bytes': 9, 'cl-matches-bytes': 75, 'invalid': 9, 'valid': 75},
 'el': {'cl-fails-bytes': 10,
        'cl-matches-bytes': 91,
        'invalid': 10,
        'valid': 91},
 'en': {'cl-fails-bytes': 18,
        'cl-matches-bytes': 78,
        'invalid': 18,
        'valid': 78},
 'es': {'cl-fails-bytes': 11,
        'cl-matches-bytes': 81,
        'invalid': 11,
        'valid': 81},
 'fr': {'cl-fails-bytes': 48,
        'cl-matches-bytes': 37,
        'invalid': 48,
        'valid': 37},
 'it': {'cl-fails-bytes': 10,
        'cl-matches-bytes': 86,
        'invalid': 10,
        'valid': 86},
 'no': {'cl-fails-bytes': 8, 'cl-matches-bytes': 93, 'invalid': 8, 'valid': 93},
 'simple': {'cl-fails-bytes': 3,
            'cl-matches-bytes': 77,
            'invalid': 3,
            'valid': 77},
 'tr': {'cl-fails-bytes': 8, 'cl-matches-bytes': 90, 'invalid': 8, 'valid': 90}}

For proton.svc.eqiad.wmnet:

{200: {'invalid': 0, 'valid': 999},
 404: {'invalid': 1, 'valid': 0},
 'ar': {'invalid': 0, 'valid': 85},
 'bg': {'invalid': 0, 'valid': 76},
 'de': {'invalid': 1, 'valid': 90},
 'el': {'invalid': 0, 'valid': 102},
 'en': {'invalid': 0, 'valid': 104},
 'es': {'invalid': 0, 'valid': 87},
 'fr': {'invalid': 0, 'valid': 82},
 'it': {'invalid': 0, 'valid': 96},
 'no': {'invalid': 0, 'valid': 85},
 'simple': {'invalid': 0, 'valid': 93},
 'tr': {'invalid': 0, 'valid': 99}}

Clarifications:

  • cl-matches-bytes means that Content-Length and the actual bytes in the response match. cl-fails-bytes means the opposite
  • Interestingly, proton returns transfer-encoding: chunked responses, that don't have a Content-Length obviously. So, for the internal service, cl-matches-bytes makes no sense and it's not there.

Points from the stats above:

  • The internal service endpoint performed well. No invalid PDFs where served, aside from a 404 (which can't lead to a valid pdf anyway)
  • For the public endpoints, all requests were successful, returning 200s. However 21.7% of those lead to invalid PDFs.
  • That exact same 21.7% also had Content-Length set to a size that did not match the actual bytes of the response. That's consistent with abruptly terminated connections.
  • This impacted all wikis, however frwiki and arwiki were disproportionately impacted for some reason. It's unclear yet why.
    • Values where overlapping in both cases
  • Using wget:
    • For an article that fails to properly render a pdf at first (with a connection closed error)
    • If I let it retry GETing the endpoint 4-5 times at some point it succesfully downloads the file.

That last paste is pretty weird. Every single response returns a different amount of bytes, forcibly closing the connection. This needs some more look into.

Where are all these run from btw ? Judging by the speeds a local DSL?

  • Next request on the same URL is succesful at first try

I guess, based on the first finding, the issue looks like its in a layer above restbase, closer to the proxies

That is quite possibly true, but I 'd like to run tests against restbase too, not just the service. I 'll modify the script for that.

I 've also ran the same tests against restbase.svc.eqiad.wmnet in P13257 and I have the following

{200: {'invalid': 0, 'valid': 996},
 400: {'invalid': 3, 'valid': 0},
 404: {'invalid': 1, 'valid': 0},
 'ar': {'invalid': 0, 'valid': 101},
 'bg': {'invalid': 0, 'valid': 89},
 'de': {'invalid': 2, 'valid': 78},
 'el': {'invalid': 0, 'valid': 104},
 'en': {'invalid': 1, 'valid': 93},
 'es': {'invalid': 0, 'valid': 84},
 'fr': {'invalid': 0, 'valid': 84},
 'it': {'invalid': 0, 'valid': 103},
 'no': {'invalid': 1, 'valid': 100},
 'simple': {'invalid': 0, 'valid': 90},
 'tr': {'invalid': 0, 'valid': 70}}

Minimal failures, 1 404 and 3 400s. So, restbase is behaving ok as well. That leaves the edge caches powering the public endpoints.

akosiaris added subscribers: ema, Vgutierrez, BBlack, ssingh.

Given we, Product Infra, are not finding issues at our service level investigation, we're asking that SRE take a look as next steps in helping to resolve. @akosiaris would you be a good person to assign?

I think that since it looks like it's the edge caches that are a probable cause of this per the investigation above, it's in fact up to the Traffic team now. I am adding them to the task and letting them take over.

Interestingly, proton returns transfer-encoding: chunked responses, that don't have a Content-Length obviously. So, for the internal service, cl-matches-bytes makes no sense and it's not there.

Just pointing out that Transfer-Encoding is a hop-by-hop HTTP header and hence it's restbase that switches to a non-chunked response and sets as required the Content-Length header

A few more tests. the TL;DR says varnish 6 is at fault probably, but with a question mark.

Test results are below, here's a summary of my observations.

  • Talking directly to ATS backend, causes no issues. Almost all requests result in PDFs that are fine. There is a small percentage 0.2% to 0.3% that returns 503s, but this is not pertinent to this task.
  • Talking directly to varnish on 2 different hosts in esams (cp3052 and cp3054) that have different versions of varnish (5 vs 6), showcases the issue. Version 5 of varnish again shows a 99.9% success rate and well formed PDFs, version 6 shows barely 58.6% success rate.
  • Interestingly, talking to ATS-TLS directly on the aforementioned hosts, contrary to my expectations, also showcases the issue on both hosts. Regardless of varnish version, the success rate is ~63%.
  • 9/11 wikis of my list (it's ['en', 'es', 'el', 'it', 'fr', 'simple', 'de', 'ar', 'bg', 'no', 'tr']) exhibit a failure rate between 10% and 31%. However, frwiki and arwiki exhibits way more. on the order of ~60% and ~90%. I have no explanation (I doubt varnish even remotely cares about language content in a PDF), but it might make easier to reproduce this while debugging it.

All of the tests were run from deploy1001.

Talking straight to 443 port on cp3052 (varnish 5) and cp3054 (varnish 6)

Talking to the HTTPS port on cp hosts directly from deploy1001, 1000 requests, same as for the other tests. That means:

client => ATS-TLS => varnish => ATS-backend => restbase => proton

cp3052 (varnish 5)

{
  "en": {
    "invalid": 29,
    "valid": 51,
    "cl-fails-bytes": 29,
    "cl-matches-bytes": 51
  },
  "it": {
    "invalid": 26,
    "valid": 79,
    "cl-fails-bytes": 26,
    "cl-matches-bytes": 79
  },
  "de": {
    "invalid": 20,
    "valid": 65,
    "cl-fails-bytes": 20,
    "cl-matches-bytes": 65
  },
  "200": {
    "invalid": 357,
    "valid": 641,
    "cl-fails-bytes": 357,
    "cl-matches-bytes": 641
  },
  "ar": {
    "invalid": 94,
    "valid": 1,
    "cl-fails-bytes": 94,
    "cl-matches-bytes": 1
  },
  "el": {
    "invalid": 30,
    "valid": 71,
    "cl-fails-bytes": 29,
    "cl-matches-bytes": 72
  },
  "400": {
    "invalid": 2,
    "valid": 0,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 2
  },
  "es": {
    "invalid": 24,
    "valid": 62,
    "cl-fails-bytes": 24,
    "cl-matches-bytes": 62
  },
  "no": {
    "invalid": 9,
    "valid": 74,
    "cl-fails-bytes": 8,
    "cl-matches-bytes": 75
  },
  "bg": {
    "invalid": 31,
    "valid": 64,
    "cl-fails-bytes": 31,
    "cl-matches-bytes": 64
  },
  "simple": {
    "invalid": 15,
    "valid": 77,
    "cl-fails-bytes": 15,
    "cl-matches-bytes": 77
  },
  "tr": {
    "invalid": 17,
    "valid": 75,
    "cl-fails-bytes": 17,
    "cl-matches-bytes": 75
  },
  "fr": {
    "invalid": 64,
    "valid": 22,
    "cl-fails-bytes": 64,
    "cl-matches-bytes": 22
  }

cp3054 (varnish 6)

{
  "es": {
    "cl-fails-bytes": 30,
    "invalid": 30,
    "valid": 65,
    "cl-matches-bytes": 65
  },
  "en": {
    "cl-fails-bytes": 30,
    "invalid": 30,
    "valid": 72,
    "cl-matches-bytes": 72
  },
  "200": {
    "cl-fails-bytes": 378,
    "invalid": 378,
    "valid": 622,
    "cl-matches-bytes": 622
  },
  "bg": {
    "cl-fails-bytes": 35,
    "invalid": 35,
    "valid": 47,
    "cl-matches-bytes": 47
  },
  "el": {
    "cl-fails-bytes": 20,
    "invalid": 20,
    "valid": 61,
    "cl-matches-bytes": 61
  },
  "simple": {
    "cl-fails-bytes": 10,
    "invalid": 10,
    "valid": 69,
    "cl-matches-bytes": 69
  },
  "tr": {
    "cl-fails-bytes": 28,
    "invalid": 28,
    "valid": 61,
    "cl-matches-bytes": 61
  },
  "no": {
    "cl-fails-bytes": 12,
    "invalid": 12,
    "valid": 78,
    "cl-matches-bytes": 78
  },
  "fr": {
    "cl-fails-bytes": 62,
    "invalid": 62,
    "valid": 19,
    "cl-matches-bytes": 19
  },
  "ar": {
    "cl-fails-bytes": 115,
    "invalid": 115,
    "valid": 2,
    "cl-matches-bytes": 2
  },
  "it": {
    "cl-fails-bytes": 19,
    "invalid": 19,
    "valid": 73,
    "cl-matches-bytes": 73
  },
  "de": {
    "cl-fails-bytes": 17,
    "invalid": 17,
    "valid": 75,
    "cl-matches-bytes": 75
  }
}
Talking straight to 80 HTTP port on cp3052 (varnish 5) and cp3054 (varnish 6)

Talking to the 80 port on cp hosts directly from deploy1001, 1000 requests, same as for the other tests. That means:

client => varnish => ATS-backend => restbase => proton

cp3052 (varnish 5)

{
  "el": {
    "cl-matches-bytes": 95,
    "valid": 95,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "de": {
    "cl-matches-bytes": 78,
    "valid": 78,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "200": {
    "cl-matches-bytes": 999,
    "valid": 999,
    "cl-fails-bytes": 1,
    "invalid": 1
  },
  "it": {
    "cl-matches-bytes": 88,
    "valid": 88,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "no": {
    "cl-matches-bytes": 92,
    "valid": 92,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "en": {
    "cl-matches-bytes": 102,
    "valid": 102,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "tr": {
    "cl-matches-bytes": 107,
    "valid": 107,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "bg": {
    "cl-matches-bytes": 70,
    "valid": 70,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "es": {
    "cl-matches-bytes": 87,
    "valid": 87,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "ar": {
    "cl-matches-bytes": 87,
    "valid": 87,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "fr": {
    "cl-matches-bytes": 90,
    "valid": 90,
    "cl-fails-bytes": 1,
    "invalid": 1
  },
  "simple": {
    "cl-matches-bytes": 103,
    "valid": 103,
    "cl-fails-bytes": 0,
    "invalid": 0
  }
}

cp3054 (varnish 6)

{
  "no": {
    "cl-fails-bytes": 15,
    "invalid": 15,
    "cl-matches-bytes": 61,
    "valid": 61
  },
  "ar": {
    "cl-fails-bytes": 82,
    "invalid": 82,
    "cl-matches-bytes": 3,
    "valid": 3
  },
  "de": {
    "cl-fails-bytes": 19,
    "invalid": 19,
    "cl-matches-bytes": 60,
    "valid": 60
  },
  "el": {
    "cl-fails-bytes": 35,
    "invalid": 35,
    "cl-matches-bytes": 59,
    "valid": 59
  },
  "fr": {
    "cl-fails-bytes": 74,
    "invalid": 74,
    "cl-matches-bytes": 31,
    "valid": 31
  },
  "simple": {
    "cl-fails-bytes": 13,
    "invalid": 13,
    "cl-matches-bytes": 81,
    "valid": 81
  },
  "200": {
    "cl-fails-bytes": 414,
    "invalid": 414,
    "cl-matches-bytes": 586,
    "valid": 586
  },
  "en": {
    "cl-fails-bytes": 29,
    "invalid": 29,
    "cl-matches-bytes": 43,
    "valid": 43
  },
  "it": {
    "cl-fails-bytes": 24,
    "invalid": 24,
    "cl-matches-bytes": 85,
    "valid": 85
  },
  "tr": {
    "cl-fails-bytes": 31,
    "invalid": 31,
    "cl-matches-bytes": 58,
    "valid": 58
  },
  "bg": {
    "cl-fails-bytes": 56,
    "invalid": 56,
    "cl-matches-bytes": 41,
    "valid": 41
  },
  "es": {
    "cl-fails-bytes": 36,
    "invalid": 36,
    "cl-matches-bytes": 64,
    "valid": 64
  }
}
Talking straight to 3128 port on cp3052 (varnish 5) and cp3054 (varnish 6)

Talking to the ATLS backend port on cp hosts directly from deploy1001, 1000 requests, same as for the other tests. That means:

client =>ATS-backend => restbase => proton

cp3052

{
  "fr": {
    "valid": 83,
    "cl-matches-bytes": 84,
    "cl-fails-bytes": 0,
    "invalid": 1
  },
  "el": {
    "valid": 79,
    "cl-matches-bytes": 79,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "simple": {
    "valid": 80,
    "cl-matches-bytes": 80,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "de": {
    "valid": 105,
    "cl-matches-bytes": 105,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "200": {
    "valid": 998,
    "cl-matches-bytes": 998,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "it": {
    "valid": 94,
    "cl-matches-bytes": 94,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "ar": {
    "valid": 113,
    "cl-matches-bytes": 113,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "tr": {
    "valid": 90,
    "cl-matches-bytes": 90,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "no": {
    "valid": 83,
    "cl-matches-bytes": 84,
    "cl-fails-bytes": 0,
    "invalid": 1
  },
  "503": {
    "valid": 0,
    "cl-matches-bytes": 2,
    "cl-fails-bytes": 0,
    "invalid": 2
  },
  "es": {
    "valid": 77,
    "cl-matches-bytes": 77,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "bg": {
    "valid": 96,
    "cl-matches-bytes": 96,
    "cl-fails-bytes": 0,
    "invalid": 0
  },
  "en": {
    "valid": 98,
    "cl-matches-bytes": 98,
    "cl-fails-bytes": 0,
    "invalid": 0
  }
}

cp3054

{
  "no": {
    "valid": 87,
    "invalid": 0,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 87
  },
  "el": {
    "valid": 93,
    "invalid": 1,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 94
  },
  "200": {
    "valid": 997,
    "invalid": 0,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 997
  },
  "simple": {
    "valid": 100,
    "invalid": 1,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 101
  },
  "tr": {
    "valid": 91,
    "invalid": 0,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 91
  },
  "bg": {
    "valid": 89,
    "invalid": 0,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 89
  },
  "ar": {
    "valid": 101,
    "invalid": 1,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 102
  },
  "es": {
    "valid": 86,
    "invalid": 0,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 86
  },
  "de": {
    "valid": 74,
    "invalid": 0,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 74
  },
  "fr": {
    "valid": 90,
    "invalid": 0,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 90
  },
  "503": {
    "valid": 0,
    "invalid": 3,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 3
  },
  "it": {
    "valid": 93,
    "invalid": 0,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 93
  },
  "en": {
    "valid": 93,
    "invalid": 0,
    "cl-fails-bytes": 0,
    "cl-matches-bytes": 93
  }
}
Jgiannelos added a comment.EditedMon, Nov 16, 8:27 AM

@akosiaris More from debugging on this issue:

  • Querying directly the RESTBase service for a PDF render of an article doesn't reproduce the issue

Looking at the pastes above, I must say I don't see a direct RESTBase service (restbase.discovery.wmnet or restbase.svc.eqiad.wmnet or restbase.svc.codfw.wmnet) call. There is a single wget against the proton service, but I don't think that gives us enough data. However, I think we indeed need to test this more.

I used the same scripts pointing to restbase.svc.eqiad.wmnet and proton.svc.eqiad.wmnet but since it didn't show any issues I didn't paste the whole run output.

Where are all these run from btw ? Judging by the speeds a local DSL?

Yes, local cable connection but i think I had similar results from the deployment node too.

Interestingly, proton returns transfer-encoding: chunked responses, that don't have a Content-Length obviously. So, for the internal service, cl-matches-bytes makes no sense and it's not there.

Just pointing out that Transfer-Encoding is a hop-by-hop HTTP header and hence it's restbase that switches to a non-chunked response and sets as required the Content-Length header

That's something I was also investigating and we might want to address in the application level (maybe when we migrate to API gateway) but given that the test results on restbase level were OK I am not sure if this somehow contributes to the problem.

@akosiaris More from debugging on this issue:

  • Querying directly the RESTBase service for a PDF render of an article doesn't reproduce the issue

Looking at the pastes above, I must say I don't see a direct RESTBase service (restbase.discovery.wmnet or restbase.svc.eqiad.wmnet or restbase.svc.codfw.wmnet) call. There is a single wget against the proton service, but I don't think that gives us enough data. However, I think we indeed need to test this more.

I used the same scripts pointing to restbase.svc.eqiad.wmnet and proton.svc.eqiad.wmnet but since it didn't show any issues I didn't paste the whole run output.

Where are all these run from btw ? Judging by the speeds a local DSL?

Yes, local cable connection but i think I had similar results from the deployment node too.

Yeah, that's consistent with my tests. Results pasted above are from the deployment node.

Interestingly, proton returns transfer-encoding: chunked responses, that don't have a Content-Length obviously. So, for the internal service, cl-matches-bytes makes no sense and it's not there.

Just pointing out that Transfer-Encoding is a hop-by-hop HTTP header and hence it's restbase that switches to a non-chunked response and sets as required the Content-Length header

That's something I was also investigating and we might want to address in the application level (maybe when we migrate to API gateway) but given that the test results on restbase level were OK I am not sure if this somehow contributes to the problem.

I don't think it does contribute in any way to the problem as well. As far as I can tell, upstream components (restbase and edge caches) behave just fine. I was just pointing it out for future reference in case anyone else went down the rabbithole.

akosiaris raised the priority of this task from Low to High.Wed, Nov 18, 10:07 AM

More and more duplicates are being merged into this one and stats from tests above suggest a mean rate of failures of ~20%, which is a lot. Bumping priority to High

I haven't been able to repro this on a public endpoint from my own home connection, even using the random-fetcher script, but that would all be against one cache in codfw.

Since this appears to be a systemic problem rather than content-specific, I've switched over to testing the original frwiki Test URI internally. My most interesting result so far has been to issue a command like this:

sudo cumin A:cp-text 'wget -S -q --tries=1 https://fr.wikipedia.org/api/rest_v1/page/pdf/Test?asdf=xyz 2>headers'

... except I vary the asdf=xyz part uniquely for each execution to ensure the content can't be cached at any of the traffic layers (I'm not yet sure if it affects RB caching). It's much harder to tell what's going on if you let your request be possibly-cached/cacheable, as even in cases where failure is rampant, eventually you'll get a lucky success stored into the caches, which will hide the failure from that point forward.

What this does is make a single-shot attempt (that --tries=1 is important or wget can fool you by retrying failed transfers up to 20 times) through the public HTTPS port of every text cache node, locally from that node (so there's zero latency or network error involved at the client edge).

The pattern that emerges in the results is that all the caches on the codfw side of our datacenter set (codfw, ulsfo, and eqsin) are always succesful, while most of the eqiad side (eqiad and esams) is not. Usually there's roughly 1 node (sometimes zero, sometimes 2) in eqiad or esams that succeeds, but all the rest fail with some partial transfer. 3 typical success-sets shown here in cumin form:

cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[5007-5012].eqsin.wmnet,cp3064.esams.wmnet,cp[4027-4032].ulsfo.wmnet
cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp1089.eqiad.wmnet,cp[5007-5012].eqsin.wmnet,cp[4027-4032].ulsfo.wmnet
cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp1083.eqiad.wmnet,cp[5007-5012].eqsin.wmnet,cp[4027-4032].ulsfo.wmnet

I'm not yet sure what this means, but it's interesting! The split mirrors how the application-layer stuff would be split across codfw vs eqiad for an active/active service at the dns-disc level.

BBlack added a comment.EditedWed, Nov 18, 1:13 PM

I'm not exactly sure as to why the pattern above emerged, but now I don't think it's relevant at all, just an artifact of the global distribution of various kinds of traffic.

Having traced a few live failures through the stack, I think it's fairly clear that this is fundamentally a varnishd nuke_limit problem. Various internal behavioral changes from Varnish 5 to 6 may have altered how nuke_limit issues play out, but probably the key driver is that we substantially raised the cap on file sizes that can be stored in the frontend cache storage in T266040 -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/635852 .

To go into a little more depth: the frontend caches have a limited pool of memory, ~200G all total, which is not sufficient to hold the entire set of hot objects for the full TTL, and thus the forcible eviction of older items to make room for new ones is inevitable. The varnishd nuke_limit parameter sets a limit on how many cache objects will be evicted during an attempt to store one new object. It currently defaults to 50, meaning that if the frontend tries to cache a 1MB file, but the cache is filled with a ton of 16KB objects, storing the 1MB file will commonly fail because it can't free up a full 1MB by evicting 50x 16KB objects. If varnishd can't find enough storage to store a copy, then it will also fail the transfer to the client. Because we stream responses for performance (forwarding the first bytes through to the user before the cache has even received the last bytes from the next layer), there's no chance to change that we already sent them a "200 OK" when the content stream gets interrupted by this failure.

Before the change referenced above, one of the limits on this problem was our VCL choosing to not cache objects >= 256KB in size in the frontend. Now that we allow much larger files (including these PDFs), there's a greater spread of object sizes being cached, and thus a greater risk of reaching nuke_limit.

There's some interesting history on nuke_limit issues in https://github.com/varnishcache/varnish-cache/issues/1764 and https://github.com/varnishcache/varnish-cache/issues/2480 . The TL;DR is the current mechanism is known to be pretty un-ideal. Some recommend simplying setting nuke_limit to 9999999, but obviously there's the potential to cause issues by nuking too much per request as well (because nuking so many things could take a while, and have a significant impact on hitrate and churn).

A few paths forward from here worth considering are:

  • We can raise nuke_limit to 1000. This is the configuration we had on our backend varnishes previously, and therefore might work well enough in the frontends today given the recent increase in large_objects_cutoff to match the backend cutoff. I've tried this manually on a single eqiad node (live change via varnishadm) and it did make my test pdf fetches stop misbehaving.
  • We could also reduce large_objects_cutoff a bit as a compromise between these two ongoing tickets. It's currently at 1GB, and maybe something smaller (e.g. 64MB) could reduce edge cases here while still providing a fair amount of the intended benefits.

Change 641724 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cache_text: nuke_limit and large_objects_cutoff

https://gerrit.wikimedia.org/r/641724

Mentioned in SAL (#wikimedia-operations) [2020-11-18T13:35:27Z] <bblack> cache_text: Executing "varnishadm -n frontend param.set nuke_limit 1000" - T266373

Change 641724 merged by BBlack:
[operations/puppet@production] cache_text: nuke_limit and large_objects_cutoff

https://gerrit.wikimedia.org/r/641724

Adding to what Brandon says, we do have evidence that it happens on edge DCs other than just eqiad and esams. Here's a user report from eqsin: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2020.11.12/network-error?id=AXW8XPuZBpCrxDVR4i4O&_g=h@44136fa

The overwhelming majority of recent NEL reports are from esams, though, which makes sense: https://logstash.wikimedia.org/goto/484929251991f679a21a5123bf20b188

We can validate the success of the change by watching for a decrease in http.response.invalid.content_length_mismatch reports after Brandon's change.

The proposed changes are live now. It may take a a few hours to confirm that via NEL at our current sample rate. At least my own artificial reproductions seem to have gone away though, for whatever that's worth!

No reports of the PDF truncations in NEL for ~8 hours now, which is a significant break from recent trends. Can anyone else still repro this in any way?

Works for me

akosiaris closed this task as Resolved.Thu, Nov 19, 9:55 AM
akosiaris claimed this task.

I am going to resolve this per the comments above. Feel free to reopen. Many thanks to @BBlack and @CDanis for figuring this out.

Zero NEL reports of http.response.invalid.content_length_mismatch on PDFs since 2020-11-18T13:28:14, so @BBlack's find was definitely the issue.

Headers and Footers are still missing and the font is still Liberation instead of Georgia..
Here are 2 pdf's for comparison:


(downloaded 10/30/2020)


(Downloaded today (Chrome Version 87.0.4280.66 (Official Build) (64-bit)))

This is definitely a bug because even though Headers and Footers are missing space is still reserved for them.

@Emdosis This task relates to very low-level issues stopping the PDFs from being delivered at all. Please open a new task for the different problem you reported, thanks :)