Page MenuHomePhabricator

Investigate failed deploy to CODFW
Closed, ResolvedPublic

Description

T163944: Timeouts on CODFW caused a major issue during today's deployment, but only CODFW nodes failed. The best I can tell is that uwsgi was not picking up new code when it restarted, but celery did. This mismatch among uwsgi processes and celery processes resulted in a bunch of errors. For some reason, every now and then a request would make it through. As far as I could tell, all of the codfw nodes had fully updated code in the /srv/deployment/ores/deploy/ directory. I'd confirmed that the uwsgi processes were being restarted too. I also confirmed that there wasn't some old, weird, version of ORES installed in /srv/deployment/ores/venv/.

This task is done when we figure out why the CODFW nodes did not successfully pick up the new code during the deployment.

Details

Related Gerrit Patches:
mediawiki/services/ores/deploy : masterAdd scb2005 and scb2006

Event Timeline

Halfak created this task.Apr 26 2017, 9:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 26 2017, 9:53 PM
Halfak updated the task description. (Show Details)Apr 26 2017, 9:54 PM
Halfak added subscribers: Ladsgroup, akosiaris.

It looks like the problem might have resolved itself somehow.

$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242349
{
  "scores": {
    "fiwiki": {
      "goodfaith": {
        "scores": {
          "3242349": {
            "prediction": true,
            "probability": {
              "false": 0.010221384505365694,
              "true": 0.9897786154946343
            }
          }
        },
        "version": "0.3.0"
      }
    }
  }
}
real	0m1.180s
user	0m0.012s
sys	0m0.000s

O.o

Looks like scb2001 is not doing great

$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242350
{
  "scores": {
    "fiwiki": {
      "goodfaith": {
        "scores": {
          "3242350": {
            "error": {
              "message": "Timed out after 15 seconds.",
              "type": "TimeoutError"
            }
          }
        },
        "version": "0.3.0"
      }
    }
  }
}
real	0m15.498s
user	0m0.004s
sys	0m0.004s

But it's not for every request

$ time curl 0.0.0.0:8081/v2/scores/fiwiki/goodfaith/3242354
{
  "scores": {
    "fiwiki": {
      "goodfaith": {
        "scores": {
          "3242354": {
            "prediction": true,
            "probability": {
              "false": 0.027350104076077075,
              "true": 0.9726498959239229
            }
          }
        },
        "version": "0.3.0"
      }
    }
  }
}
real	0m1.161s
user	0m0.008s
sys	0m0.004s

I looked at this with @Ladsgroup and we think we figured it out! There's scb2005 and scb2006! We didn't know they existed and we haven't been deploying to those. These are the source of the old-code/new-code problem.

Change 350575 had a related patch set uploaded (by Ladsgroup; owner: amir):
[mediawiki/services/ores/deploy@master] Add scb2005 and scb2006

https://gerrit.wikimedia.org/r/350575

Change 350575 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Add scb2005 and scb2006

https://gerrit.wikimedia.org/r/350575

Mentioned in SAL (#wikimedia-operations) [2017-04-27T17:15:03Z] <Amir1> ladsgroup@naos:/srv/deployment/ores/deploy$ scap deploy (T163950)

I looked at this with @Ladsgroup and we think we figured it out! There's scb2005 and scb2006! We didn't know they existed and we haven't been deploying to those. These are the source of the old-code/new-code problem.

Sigh. yeah this is effectively an issue with needing a scap.cfg update with every host change (removal, rename, add, pool, depool and so on). Which IMHO, is wrong. In T159486 we did update all the other services (9 in number) but I completely forgot about ORES. Sorry about that

FWIW, I think this is clearly error prone, and I would argue wrong. It's not a clear separation of concerns for every single service to know those hosts. That information should be more centralized, contained to scap and not get leaked into every individual service's scap.cfg

Also, this was already diagnosed as a problem back in 2016. See https://gerrit.wikimedia.org/r/#/c/306431 (which needs some updates, but is still a valid approach).

Halfak closed this task as Resolved.Jun 5 2017, 5:07 PM
Halfak claimed this task.