Page MenuHomePhabricator

Randomly failing puppetmaster sync to strontium
Closed, ResolvedPublic

Description

We have seen random failures of the sync from palladium to strontium of the puppet repository after running puppet-merge on palladium, through the post-merge git hook.

Given that the last time it happened I have seen the login entry in strontium for the git user (in last), it looks like the error is while running the commands in strontium through ssh.

Event Timeline

Change 275031 had a related patch set uploaded (by Volans):
Git: add logging to post-merge hook on puppetmaster

https://gerrit.wikimedia.org/r/275031

Change 275031 merged by Volans:
Git: add logging to post-merge hook on puppetmaster

https://gerrit.wikimedia.org/r/275031

Mentioned in SAL [2016-03-04T20:00:44Z] <volans> Added logging to post-merge hook on palladium T128895

Change 275044 had a related patch set uploaded (by Volans):
Git: create the log file for T128895

https://gerrit.wikimedia.org/r/275044

Change 275044 merged by Volans:
Git: create the log file for T128895

https://gerrit.wikimedia.org/r/275044

Added some additional logging to the post-merge hook so that next time it fails we should have some additional info

I saw that happened again today from #wikimedia-operations IRC channel:

Wed 04:45:15   grrrit-wm| (CR) Yuvipanda: [C: 2] tools: Add paws as a separate host [puppet] -
                          https://gerrit.wikimedia.org/r/276090 (https://phabricator.wikimedia.org/T129209) (owner:
                          Yuvipanda)
Wed 05:01:48   icinga-wm| PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged
                          change in puppet (dir /var/lib/git/operations/puppet).
Wed 05:07:24   icinga-wm| RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.

Looking at the added log I found:

remote: Counting objects: 1   ^Mremote: Counting objects: 7653, done
remote: Finding sources:  14% (1/7)   ^Mremote: Finding sources:  28% (2/7)   ^Mremote: Finding sources:  42% (3/7)   ^Mremote: Finding sources:  57% (4/7)   ^Mremote: Finding sources:  71% (5/7)
  ^Mremote: Finding sources:  85% (6/7)   ^Mremote: Finding sources: 100% (7/7)   ^Mremote: Finding sources: 100% (7/7)
remote: Getting sizes:  50% (1/2)   ^Mremote: Getting sizes: 100% (2/2)   ^Mremote: Getting sizes: 100% (2/2)
remote: Total 7 (delta 4), reused 7 (delta 4)
Unpacking objects:  14% (1/7)   ^MUnpacking objects:  28% (2/7)   ^MUnpacking objects:  42% (3/7)   ^MUnpacking objects:  57% (4/7)   ^MUnpacking objects:  71% (5/7)   ^MUnpacking objects:  85% (6
/7)   ^MUnpacking objects: 100% (7/7)   ^MUnpacking objects: 100% (7/7), done.
error: Ref refs/remotes/origin/production is at 617f626de400a53620d1b39d4a0d00f418512b77 but expected f6174ca0b365a7fb1c56473d8353ed99a5ddf22c
From https://gerrit.wikimedia.org/r/p/operations/puppet
 ! f6174ca..617f626  production -> origin/production  (unable to update local ref)
Connection to strontium.eqiad.wmnet closed.

And from git log:

commit 617f626de400a53620d1b39d4a0d00f418512b77
Author: YuviPanda <...OMISSIS...>
Date:   Tue Mar 8 19:42:39 2016 -0800

    tools: Add paws as a separate host

    Bug: T129209
    Change-Id: I97618a1f9369e57e232c243b66d6dc896717ce62

commit f6174ca0b365a7fb1c56473d8353ed99a5ddf22c
Author: YuviPanda <...OMISSIS...>
Date:   Tue Mar 8 19:32:24 2016 -0800

    tools: Add specific paging monitoring for PAWS too

    Bug: T129209
    Change-Id: I271c754a18a73653ba0bcd5af64ceff3df7a2f5f

So it looks like that while running on strontium:

cd /var/lib/git/operations/puppet && git fetch --verbose --update-head-ok && git reset --hard origin/production

The git remote was not yet up to date. The last 2 commits where 10 minutes apart though...
I don't have time right now to continue the investigation, I'll try in the next days, but if anyone want to take a look go ahead.

ariel@palladium:~$ sudo -s
root@palladium:~# puppet-merge
Fetching new commits from https://gerrit.wikimedia.org/r/p/operations/puppet
remote: Counting objects: 8469, done
remote: Finding sources: 100% (7/7)
remote: Getting sizes: 100% (1/1)
remote: Total 7 (delta 5), reused 7 (delta 5)
Unpacking objects: 100% (7/7), done.
error: Ref refs/remotes/origin/production is at 01c4e9ed049a3482aacb47e26e7407b93d2d3050 but expected be17d1d144d3582027a43d0d481b09f62ba9442b
From https://gerrit.wikimedia.org/r/p/operations/puppet
! be17d1d..01c4e9e production -> origin/production (unable to update local ref)
root@palladium:~# puppet-merge

So this is on palladium, another case of something not quite in sync, an immediate retry worked fine on palladium and the push to strontium as well.

noticed this too from time to time, also note that the ssh command in post-merge hook will not end up doing what it says because of gitpuppet's authorized_keys:

# If no key is forwarded then this will use the ready-made equivalent command
      #  on worker and ignore our command.
      ssh -t -t strontium.eqiad.wmnet 'cd /var/lib/git/operations/puppet && git fetch --verbose --update-head-ok && git reset --hard origin/production' 2>&1 | tee -a /var/log/puppet-post-merge.log

though

# cat /etc/ssh/userkeys/gitpuppet
command="/bin/sh -c 'cd /var/lib/git/operations/puppet && git pull && git submodule update --init'" ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC905HaavYYVgE8uqq7amMO0Ec0Fg46VwLravClYI5XxnVa/723AeXiw...
command="git-receive-pack /srv/private" ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC78rNzhr8jlPbmCfCxtDtn7VHpI7AOqVQf94oX/j5vMj/8t+whSoit... private repo

FWIW I just saw this happening on palladium too, _before_ displaying the diff thus while updating from gerrit, repeating puppet-merge works though, looks like some sort of race between git/puppet-merge and gerrit

$ sudo puppet-merge
Fetching new commits from https://gerrit.wikimedia.org/r/p/operations/puppet
remote: Counting objects: 6304, done
remote: Finding sources: 100% (4/4)
remote: Getting sizes: 100% (1/1)
remote: Compressing objects: 100% (839/839)
remote: Total 4 (delta 2), reused 2 (delta 2)
Unpacking objects: 100% (4/4), done.
error: Ref refs/remotes/origin/production is at 8946b52a38d773b3052ca0b74723920b0728a665 but expected c3a3901b1fc3a6409456c3079dbe08a0fb79eeae
From https://gerrit.wikimedia.org/r/p/operations/puppet
 ! c3a3901..8946b52  production -> origin/production  (unable to update local ref)

And failed again:

Merge these changes? (yes/no)? yes
Merging 4fa5a8428500739447db13d4f2db3da4f5f900c1...
git merge --ff-only 4fa5a8428500739447db13d4f2db3da4f5f900c1
Updating 12fc29d..4fa5a84
Fast-forward
 manifests/site.pp |    1 +
 1 file changed, 1 insertion(+)
error: Ref refs/remotes/origin/production is at 4fa5a8428500739447db13d4f2db3da4f5f900c1 but expected 12fc29d3030bb600f07250b6c2a636eec839cfad
From https://gerrit.wikimedia.org/r/p/operations/puppet
 ! 12fc29d..4fa5a84  production -> origin/production  (unable to update local ref)
Connection to strontium.eqiad.wmnet closed.
Running git clean to clean any untracked files.
git clean -dffx -e /private/
HEAD is now 4fa5a8428500739447db13d4f2db3da4f5f900c1.
/home/jmm
Now running conftool-merge to sync any changes to conftool data

and again, actually not while synching to strontium but as soon as puppet-merge is ran

palladium:~$ sudo puppet-merge
Fetching new commits from https://gerrit.wikimedia.org/r/p/operations/puppet
error: Ref refs/remotes/origin/production is at b10ced9695b64f11ab407f3d77cb9abd7aa57026 but expected 647b60044dae7cdb5b4c179aab2ca55deded1891
From https://gerrit.wikimedia.org/r/p/operations/puppet
 ! 647b600..b10ced9  production -> origin/production  (unable to update local ref)

found this bug report from gerrit which _might_ be related and fixed in newer git versions: https://bugs.chromium.org/p/chromium/issues/detail?id=165330#c8

and again, actually not while synching to strontium but as soon as puppet-merge is ran

palladium:~$ sudo puppet-merge
Fetching new commits from https://gerrit.wikimedia.org/r/p/operations/puppet
error: Ref refs/remotes/origin/production is at b10ced9695b64f11ab407f3d77cb9abd7aa57026 but expected 647b60044dae7cdb5b4c179aab2ca55deded1891
From https://gerrit.wikimedia.org/r/p/operations/puppet
 ! 647b600..b10ced9  production -> origin/production  (unable to update local ref)

That one is a heisenbug for me. It only happens once, in either my local git repo. The only time is creates an actionable is when it happens on strontium but not palladium (which makes it even more weird, but does rule out some cases of a race condition) at which case palladium has the newer git objects, but strontium no.

Running puppet-merge on strontium fixes it.

Anyway, if indeed the bug filippo linked to is our problem, we should have it solved this Q given it's a goal to upgrade the puppetmaster to jessie, the git version of which should have the bug fixed.

fwiw, strontium is now dead. but we have rhodium. i wonder if that changed this ticket at all.

akosiaris changed the task status from Open to Stalled.Sep 14 2016, 5:28 PM
akosiaris lowered the priority of this task from High to Low.

I am thinking this is no longer happening. I haven't yet witnessed it on strontium. Probably the gerrit migration to a newer version fixed some of the issues and the newer git version on the jessie puppetmasters the rest. Let's stall it for a while and see if it happens again and act accordingly

Change 333162 had a related patch set uploaded (by Volans):
Puppetmaster: remove temporary logging for debugging

https://gerrit.wikimedia.org/r/333162

@akosiaris it's probably time to remove this patch that I made a year ago, since it's not happening anymore. Thoughts?

Change 333162 merged by Alexandros Kosiaris:
Puppetmaster: remove temporary logging for debugging

https://gerrit.wikimedia.org/r/333162

Agreed. Patch merged. Also resolving this. We haven't witnessed this for quite a while (>3months) so I think we are done with this

akosiaris claimed this task.