Page MenuHomePhabricator

salt on deployment-pdf02.deployment-prep.eqiad.wmflabs is wedged
Closed, ResolvedPublic

Description

cscott@deployment-bastion:/srv/deployment/ocg/ocg$ git deploy sync
Repo: ocg/ocg
Tag: ocg/ocg-sync-20150623-054042

2/2 minions completed fetch
Continue? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry): y
Repo: ocg/ocg
Tag: ocg/ocg-sync-20150623-054042

1/2 minions completed checkout
Continue? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry): d
Repo: ocg/ocg
Tag: ocg/ocg-sync-20150623-054042

1/2 minions completed checkout

Details:

deployment-pdf02.deployment-prep.eqiad.wmflabs: 
	checkout status: 50 [started: 0 mins ago, last-return: 0 mins ago]
Continue? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry): r
Continue? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry): d
Repo: ocg/ocg
Tag: ocg/ocg-sync-20150623-054042

1/2 minions completed checkout

Details:

deployment-pdf02.deployment-prep.eqiad.wmflabs: 
	checkout status: 50 [started: 0 mins ago, last-return: 0 mins ago]
Continue? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry):

ad infinitum.

Probably some sort of permissions problem on pdf02? I don't have root on the ocg machines, so I can't fix it. :(

deployment-pdf02 is not actually in the round-robin, at the moment, so the fact that it's down is harmless as far as OCG running on beta goes. However, the service periodically tries to restart and fills up kibana with:

Cannot immediately respawn thread. Waiting 1s to avoid forkbombing.
Worker (pid 26350) has disconnected. Suicide: false. Restarting: true.

over and over again. So deployment-pdf02 should be fixed, if only to avoid spamming kibana.

Event Timeline

cscott raised the priority of this task from to Medium.
cscott updated the task description. (Show Details)
cscott subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Oh, I forgot to add:

cscott@deployment-pdf02:/srv/deployment/ocg/ocg/mw-ocg-service$ git log
commit 2b2816081120a5e65f5bf80c294b2a85a1de38f8
Author: C. Scott Ananian <cscott@cscott.net>
Date:   Mon Nov 10 16:03:24 2014 -0500

    Double cache lifetime of successful renders, to 4 days.

That's a very old commit! The mw-ocg-service submodule should be at 2941c3dc0f25c654d7ddab1b82ecb726728551e1, with a timestamp of Mon Jun 22 23:28:07 2015 -0400. It's correct on deployment-bastion and deployment-pdf01, of course.

Probably some sort of permissions problem on pdf02? I don't have root on the ocg machines, so I can't fix it. :(

Aren't you a projectadmin? You should be able to sudo everywhere...

ocg-render-admins:
  gid: 721
  description: admins for pdf render (rt 6468)
  members: [cscott, ssastry, gwicke, arlolra]
  privileges: ['ALL = NOPASSWD: /usr/sbin/service ocg *',
               'ALL = (ocg) NOPASSWD: ALL']

ocg-render-admins can be ocg, but they can't be root.

Are you saying this relies on the OCG hosts in production? Because you're projectadmin on the deployment-prep, which should allow you to sudo as root.

Well, I'll be:

cscott@deployment-pdf02:~$ sudo -s
root@deployment-pdf02:~#

I guess I was already sudoed to ocg before, and never noticed that I had superpowers as myself. Let's see if I can figure out what's getting salt stuck here, then.

Happiness:

Repo: ocg/ocg
Tag: ocg/ocg-sync-20150623-132307

2/2 minions completed checkout

Details:

I had to manually clean up some dirty repos as root -- there were local changes that salt didn't want to overwrite, but the local changes don't look like they were manually created, they looked like some ancient salt invocation gone awry. Anyway, I made deployment-pdf02 match deployment-pdf01, and now sync works and the service on deployment-pdf02 has been restarted. Victory.

Thanks, guys.

cscott claimed this task.