Page MenuHomePhabricator

dbtree broken (for some users?)
Closed, ResolvedPublic

Description

some users report dbtree is broken for them, while other users don't see the error and it works for them.

18:00 < kaldari> mutante: When I try to go to https://dbtree.wikimedia.org/ it says "database connection to tendril on tendril-backend.eqiad.wmnetfailed"

18:02 < Zppix> dbtree is working for me kaldari 
18:02 < mutante> yea.. hmm.. works for me.. but that's not an error that sounds like local

18:04 < bd808> dbtree is busted for me too. same "**database connection to tendril on tendril-backend.eqiad.wmnet**failed" message
18:04 < bd808> x-cache header says "cp2006 miss, cp4001 hit/2, cp4003 hit/3"
18:04 < mutante> so dbtree uses misc-varnish, the director is "noc"
18:04 < mutante> "noc" has 2 backends, terbium and wasat
18:07 < mutante> tendril-backend.eqiad = db1011
18:07 < mutante> there is no tendril-backend.codfw
18:07 < mutante> so we are not talking to different backends.. uhmm

18:07 < Zppix> my x-varnish:36791322, 6094755 6124425
18:08 < mutante> db1011 appears to be running normal afaict

18:09 < bd808> so it looks like the sf varnish is the bad one
18:10 < mutante> but why would it be a varnish problem if it is "database connection to tendril"

Event Timeline

Dzahn created this task.Apr 14 2017, 1:32 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 14 2017, 1:32 AM
Dzahn added subscribers: kaldari, Zppix, bd808, jcrespo.

it works for me.

Peachey88 updated the task description. (Show Details)Apr 14 2017, 4:49 AM
jcrespo changed the task status from Open to Stalled.Apr 14 2017, 7:01 PM

Most likely a one-time error that got cached for some time? Tendril db tends to fail quite regularly due to large queries asking for large reports (but that is mostly ok). We can fine-tune varnish, but honestly, we have to refactor dbtree and tendril soon.

Stalling unless someone can reproduce a day later so we do something manually at varnish.

bd808 added a comment.Apr 14 2017, 8:33 PM
Accept-Ranges: bytes
Age: 10
Content-Encoding: gzip
Content-Length: 76
Content-Type: text/html; charset=UTF-8
Date: Fri, 14 Apr 2017 20:31:37 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Vary: Accept-Encoding
Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
X-Cache: cp2006 miss, cp4001 hit/1, cp4004 miss
X-Firefox-Spdy: h2
backend-timing: D=183187 t=1492201886821931
x-analytics: WMF-Last-Access=14-Apr-2017;WMF-Last-Access-Global=14-Apr-2017;https=1
x-cache-status: hit
x-client-ip: xxx.xxx.xxx.xxx
x-varnish: 166527815, 57923458 58897563, 7345753
jcrespo changed the task status from Stalled to Open.Apr 14 2017, 11:45 PM
jcrespo added a project: Traffic.

I assume that is a hit of an error message?

Traffic: What is tendril.wikimedia.org's caching policy so that this can happen? I would expect a smaller TTL than a day...

bd808 added a comment.Apr 15 2017, 2:23 PM

It's working for me today with this response header: X-Cache: cp1058 miss, cp1045 miss. This indicates a different route than I was getting on Thursday/Friday. I can still recreate by requesting results from cp2006.eqiad.wmnet directly:

terbium:~
bd808$ curl -iL https://dbtree.wikimedia.org --resolve dbtree.wikimedia.org:443:10.192.0.127
HTTP/1.1 200 OK
Date: Sat, 15 Apr 2017 14:21:41 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 67
Connection: keep-alive
Server: Apache
Backend-Timing: D=195851 t=1492263188989672
Vary: Accept-Encoding
X-Varnish: 25276639 24364969, 7708074 7707960
Via: 1.1 varnish-v4, 1.1 varnish-v4
Age: 2912
X-Cache: cp2006 hit/1, cp2006 hit/4
X-Cache-Status: hit
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Set-Cookie: WMF-Last-Access=15-Apr-2017;Path=/;HttpOnly;secure;Expires=Wed, 17 May 2017 12:00:00 GMT
Set-Cookie: WMF-Last-Access-Global=15-Apr-2017;Path=/;Domain=.wikimedia.org;HttpOnly;secure;Expires=Wed, 17 May 2017 12:00:00 GMT
X-Analytics: https=1;nocookies=1
X-Client-IP: 10.64.32.13
Accept-Ranges: bytes

database connection to tendril on tendril-backend.eqiad.wmnetfailed
BBlack added a subscriber: BBlack.Apr 17 2017, 1:51 PM

tendril.wikimedia.org is independent of varnish, only dbtree.wikimedia.org (that we're talking about here) goes through the standard varnish stuff (although arguably tendril should be moved there as well someday).

As for dbtree cache policy: assuming the application doesn't send any explicit cache headers, the pages will normally get 1-hour cache lifetimes on cache_misc by default, and dbtree doesn't have any special exceptions to that (e.g. forced pass-mode).

Most likely the source of the problem was the switch of the noc backend used by dbtree.wm.o to be active/active in https://gerrit.wikimedia.org/r/#/c/346572/ back on Apr 6. It seems like the noc backend wasat which is used by ulsfo and codfw can't actually service dbtree requests.

(also, generally speaking errors aren't cached, but in this case the error would be cached, because it's returned with a 200 status code...)

Change 348456 had a related patch set uploaded (by BBlack):
[operations/puppet@production] dbtree: split backend from noc.wm.o, make eqiad-only

https://gerrit.wikimedia.org/r/348456

Change 348456 merged by BBlack:
[operations/puppet@production] dbtree: split backend from noc.wm.o, make eqiad-only

https://gerrit.wikimedia.org/r/348456

Dzahn added a comment.Apr 17 2017, 7:20 PM

seems like we have 2 follow-ups:

  • make dbtree not use status code 200 for an error page
  • make wasat a working dbtree backend, then add it back to varnish director

I guess they should be separate sub-tasks. Would you agree @jcrespo?

Dzahn added a comment.EditedApr 17 2017, 7:22 PM

@kaldari @bd808 Does dbtree work for you again? I am wondering if this ticket can be called resolved (if we follow-up with the things above i suppose).

@BBlack thanks for the changes and explanation. should we still manually purge the cached page?

bd808 added a comment.Apr 17 2017, 7:45 PM

@kaldari @bd808 Does dbtree work for you again? I am wondering if this ticket can be called resolved (if we follow-up with the things above i suppose).

The curl command that I was using to reproduce in T162976#3184518 works now and has X-Cache: cp1058 miss, cp2006 miss, cp2006 hit/2 in the result.

Yeah the patch I deployed above should have fixed the issue in this ticket. Both of the suggested followups would be ideal, but probably aren't pressing at this time.

Dzahn closed this task as Resolved.Apr 17 2017, 7:50 PM
Dzahn claimed this task.
Dzahn removed Dzahn as the assignee of this task.
Dzahn assigned this task to BBlack.