Page MenuHomePhabricator

dbtree broken (for some users?)
Closed, ResolvedPublic


some users report dbtree is broken for them, while other users don't see the error and it works for them.

18:00 < kaldari> mutante: When I try to go to it says "database connection to tendril on tendril-backend.eqiad.wmnetfailed"

18:02 < Zppix> dbtree is working for me kaldari 
18:02 < mutante> yea.. hmm.. works for me.. but that's not an error that sounds like local

18:04 < bd808> dbtree is busted for me too. same "**database connection to tendril on tendril-backend.eqiad.wmnet**failed" message
18:04 < bd808> x-cache header says "cp2006 miss, cp4001 hit/2, cp4003 hit/3"
18:04 < mutante> so dbtree uses misc-varnish, the director is "noc"
18:04 < mutante> "noc" has 2 backends, terbium and wasat
18:07 < mutante> tendril-backend.eqiad = db1011
18:07 < mutante> there is no tendril-backend.codfw
18:07 < mutante> so we are not talking to different backends.. uhmm

18:07 < Zppix> my x-varnish:36791322, 6094755 6124425
18:08 < mutante> db1011 appears to be running normal afaict

18:09 < bd808> so it looks like the sf varnish is the bad one
18:10 < mutante> but why would it be a varnish problem if it is "database connection to tendril"

Event Timeline

jcrespo changed the task status from Open to Stalled.Apr 14 2017, 7:01 PM

Most likely a one-time error that got cached for some time? Tendril db tends to fail quite regularly due to large queries asking for large reports (but that is mostly ok). We can fine-tune varnish, but honestly, we have to refactor dbtree and tendril soon.

Stalling unless someone can reproduce a day later so we do something manually at varnish.

Accept-Ranges: bytes
Age: 10
Content-Encoding: gzip
Content-Length: 76
Content-Type: text/html; charset=UTF-8
Date: Fri, 14 Apr 2017 20:31:37 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Vary: Accept-Encoding
Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
X-Cache: cp2006 miss, cp4001 hit/1, cp4004 miss
X-Firefox-Spdy: h2
backend-timing: D=183187 t=1492201886821931
x-analytics: WMF-Last-Access=14-Apr-2017;WMF-Last-Access-Global=14-Apr-2017;https=1
x-cache-status: hit
x-varnish: 166527815, 57923458 58897563, 7345753
jcrespo changed the task status from Stalled to Open.Apr 14 2017, 11:45 PM
jcrespo added a project: Traffic.

I assume that is a hit of an error message?

Traffic: What is's caching policy so that this can happen? I would expect a smaller TTL than a day...

It's working for me today with this response header: X-Cache: cp1058 miss, cp1045 miss. This indicates a different route than I was getting on Thursday/Friday. I can still recreate by requesting results from cp2006.eqiad.wmnet directly:

bd808$ curl -iL --resolve
HTTP/1.1 200 OK
Date: Sat, 15 Apr 2017 14:21:41 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 67
Connection: keep-alive
Server: Apache
Backend-Timing: D=195851 t=1492263188989672
Vary: Accept-Encoding
X-Varnish: 25276639 24364969, 7708074 7707960
Via: 1.1 varnish-v4, 1.1 varnish-v4
Age: 2912
X-Cache: cp2006 hit/1, cp2006 hit/4
X-Cache-Status: hit
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Set-Cookie: WMF-Last-Access=15-Apr-2017;Path=/;HttpOnly;secure;Expires=Wed, 17 May 2017 12:00:00 GMT
Set-Cookie: WMF-Last-Access-Global=15-Apr-2017;Path=/;;HttpOnly;secure;Expires=Wed, 17 May 2017 12:00:00 GMT
X-Analytics: https=1;nocookies=1
Accept-Ranges: bytes

database connection to tendril on tendril-backend.eqiad.wmnetfailed is independent of varnish, only (that we're talking about here) goes through the standard varnish stuff (although arguably tendril should be moved there as well someday).

As for dbtree cache policy: assuming the application doesn't send any explicit cache headers, the pages will normally get 1-hour cache lifetimes on cache_misc by default, and dbtree doesn't have any special exceptions to that (e.g. forced pass-mode).

Most likely the source of the problem was the switch of the noc backend used by dbtree.wm.o to be active/active in back on Apr 6. It seems like the noc backend wasat which is used by ulsfo and codfw can't actually service dbtree requests.

(also, generally speaking errors aren't cached, but in this case the error would be cached, because it's returned with a 200 status code...)

Change 348456 had a related patch set uploaded (by BBlack):
[operations/puppet@production] dbtree: split backend from noc.wm.o, make eqiad-only

Change 348456 merged by BBlack:
[operations/puppet@production] dbtree: split backend from noc.wm.o, make eqiad-only

seems like we have 2 follow-ups:

  • make dbtree not use status code 200 for an error page
  • make wasat a working dbtree backend, then add it back to varnish director

I guess they should be separate sub-tasks. Would you agree @jcrespo?

@kaldari @bd808 Does dbtree work for you again? I am wondering if this ticket can be called resolved (if we follow-up with the things above i suppose).

@BBlack thanks for the changes and explanation. should we still manually purge the cached page?

@kaldari @bd808 Does dbtree work for you again? I am wondering if this ticket can be called resolved (if we follow-up with the things above i suppose).

The curl command that I was using to reproduce in T162976#3184518 works now and has X-Cache: cp1058 miss, cp2006 miss, cp2006 hit/2 in the result.

Yeah the patch I deployed above should have fixed the issue in this ticket. Both of the suggested followups would be ideal, but probably aren't pressing at this time.

Dzahn claimed this task.
Dzahn removed Dzahn as the assignee of this task.
Dzahn assigned this task to BBlack.