Page MenuHomePhabricator

analytics/refinery deployment broken at refinery-deploy-to-hdfs
Closed, ResolvedPublic

Description

When deploying analytics refinery, we first perform a scap deploy, then we run an ad-hoc utility to push the files on HDFS refinery-deploy-to-hdfs.

refinery-deploy-to-hdfs uses git commands to know what files to send. It crashes when running an equivalent of a git status.

In fact, running git status in /srv/deployment/analytics/refinery fails with OSError: [Errno 13] Permission denied: '.git/fat/objects/tmpxUXHUu'

Providing write permission to other in .git/fat solves the problem temporarily. We did it on the test cluster.

Related to: T214229

Event Timeline

I got similar error when deploying analytics refinery:

when I run git log in /srv/deployment/analytics/refinery it fails with error

fatal: detected dubious ownership in repository at '/srv/deployment/analytics/refinery-cache/revs/4e8f1ac406321dab19726dbecdd198d5e8e130d4'
To add an exception for this directory, call:

git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/4e8f1ac406321dab19726dbecdd198d5e8e130d4

to resolve this i ran

git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/4e8f1ac406321dab19726dbecdd198d5e8e130d4

But when running the utility refinery-deploy-to-hdfs to push files on HDFS it crashes with the error.
Error: Cannot describe current version.

A temporal solution gotten from here
I ran this command using hdfs using:

sudo -u hdfs kerberos-run-command hdfs git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/4e8f1ac406321dab19726dbecdd198d5e8e130d4
JAllemandou triaged this task as Unbreak Now! priority.Apr 20 2023, 4:23 PM

Aha! Now I'm experiencing this error as well.

btullis@an-launcher1002:/srv/deployment/analytics/refinery$ git log -n 1
fatal: detected dubious ownership in repository at '/srv/deployment/analytics/refinery-cache/revs/1631dea27be8b8588c220cb352b1c974a5ddef28'
To add an exception for this directory, call:

	git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/1631dea27be8b8588c220cb352b1c974a5ddef28

Investigating now.

This seems to be related to a recent security upgrade of the git package: {T335354}

That ticket is currently active, so I'll add further information to that and then update here.

A temporal solution gotten from here
I ran this command using hdfs using:

sudo -u hdfs kerberos-run-command hdfs git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/4e8f1ac406321dab19726dbecdd198d5e8e130d4

Thanks for recording the steps you took to implement the workaround @Snwachukwu. 👍
We can see that when you executed the workaround:

sudo -u hdfs kerberos-run-command hdfs git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/4e8f1ac406321dab19726dbecdd198d5e8e130d4

...the result was the creation of the following file:

btullis@an-launcher1002:/var/lib/hadoop-hdfs$ cat .gitconfig 
[safe]
	directory = /srv/deployment/analytics/refinery-cache/revs/4e8f1ac406321dab19726dbecdd198d5e8e130d4
	directory = /srv/deployment/analytics/refinery-cache/revs/4e8f1ac406321dab19726dbecdd198d5e8e130d4
	directory = /srv/deployment/analytics/refinery-cache/revs/1631dea27be8b8588c220cb352b1c974a5ddef28

However, the command would have to be run for each refinery deploy. I have asked in T335354#8807059 whether there is a standard mechanism for managing these global settings.

I can use the same workaround to complete this deploy for now.

I added the workaround, both as my own user on an-launcher1002 and as the hdfs user.

btullis@an-launcher1002:/srv/deployment/analytics/refinery$ sudo -u hdfs kerberos-run-command hdfs git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/1631dea27be8b8588c220cb352b1c974a5ddef28
btullis@an-launcher1002:/srv/deployment/analytics/refinery$ git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/1631dea27be8b8588c220cb352b1c974a5ddef28

However, when I then tried to run git status as my own user, I saw the same problem as originally reported in the decription above:

btullis@an-launcher1002:/srv/deployment/analytics/refinery$ git status
<snip>
OSError: [Errno 13] Permission denied: '.git/fat/objects/tmpVxbkiq'
error: external filter 'git-fat filter-clean' failed 1

I decided to su to the owner of the repository and check git status
This showed a detatched head.

btullis@an-launcher1002:/srv/deployment/analytics/refinery$ sudo -u analytics-deploy git status
HEAD detached at 1631dea2
nothing to commit, working tree clean

I went to pull the master branch as this user:

btullis@an-launcher1002:/srv/deployment/analytics/refinery$ sudo -u analytics-deploy -i
analytics-deploy@an-launcher1002:~$ cd /srv/deployment/analytics/refinery
analytics-deploy@an-launcher1002:/srv/deployment/analytics/refinery$ git status
HEAD detached at 1631dea2
nothing to commit, working tree clean
analytics-deploy@an-launcher1002:/srv/deployment/analytics/refinery$ git checkout master
Checking out files: 100% (786/786), done.
Previous HEAD position was 1631dea2 Remove deprecated all_settings streamconfigs param
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
analytics-deploy@an-launcher1002:/srv/deployment/analytics/refinery$ git pull
From /srv/deployment/analytics/refinery-cache/cache
 * [new tag]           scap/sync/2023-04-20/0002 -> scap/sync/2023-04-20/0002
 * [new tag]           scap/sync/2023-04-20/0003 -> scap/sync/2023-04-20/0003
Already up to date.
analytics-deploy@an-launcher1002:/srv/deployment/analytics/refinery$ git log -n 1
commit 4aba3708a41b737f577ece8e832b2a4f374eb5fa (HEAD -> master, tag: scap/sync/2020-06-25/0003, tag: scap/sync/2020-06-25/0002, tag: scap/sync/2020-06-25/0001, origin/master, origin/HEAD)
Author: Joseph Allemandou <joal@wikimedia.org>
Date:   Thu Jun 25 09:53:29 2020 +0200

    Correct pageview_actor_hourly bug
    
    The bug appears in a non-deterministic way, so it's difficicult to pinpoint.
    It seems to be https://issues.apache.org/jira/browse/HIVE-14555 so the
    fix provided is to disable map-side join for the job.
    Also bump the refinery-jar version to latest (noop).
    
    Bug: T255467
    Change-Id: Idd159280aed4045136835dc8c5924537533ce222

This showed that the date of the master branch was back in June of 2020.

At this point I decided to do another full refinery deploy from the deployment server.

When this finished, I used the workaround to add the safe directory again:

btullis@an-launcher1002:/srv/deployment/analytics/refinery$ sudo -u hdfs kerberos-run-command hdfs git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/571f9558ee28d58db87237ca2e016464c5db595e
btullis@an-launcher1002:/srv/deployment/analytics/refinery$ git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/571f9558ee28d58db87237ca2e016464c5db595e

I was then able to check that the latest version of the code had been successfully pulled to an-launcher1002.

btullis@an-launcher1002:/srv/deployment/analytics/refinery$ git log -n 1
commit 571f9558ee28d58db87237ca2e016464c5db595e (HEAD, tag: scap/sync/2023-04-26/0004, tag: scap/sync/2023-04-26/0003, tag: scap/sync/2023-04-26/0002, tag: scap/sync/2023-04-26/0001)
Author: Ben Tullis <btullis@wikimedia.org>
Date:   Tue Apr 25 14:58:20 2023 +0100

    Add guw.wikinews and kbd.wiktionary to the allowlist
    
    These wikis have been added recently, so must be added to the pageview
    allowlist.
    
    Bug: T334459
    Bug: T333266
    Change-Id: I88aafd403114c0c5684fa29e2bd04644b5c5651b

Finally, I was then able to run a successful refinery-deploy-to-hdfs

btullis@an-launcher1002:/srv/deployment/analytics/refinery$ sudo -u hdfs kerberos-run-command hdfs /srv/deployment/analytics/refinery/bin/refinery-deploy-to-hdfs --verbose --no-dry-run
2023-04-26T10:55:26+00:00 Current git description: '2023-04-26T10.55.26+00.00--scap_sync_2023-04-26_0004-dirty'
2023-04-26T10:55:26+00:00
2023-04-26T10:55:26+00:00 * Preparing HDFS ...
2023-04-26T10:55:26+00:00 hdfs dfs -D fs.permissions.umask-mode=022 -mkdir -p hdfs:///wmf/refinery
2023-04-26T10:55:28+00:00
2023-04-26T10:55:28+00:00 * Copying local checkout to versioned directory in HDFS ...
2023-04-26T10:55:28+00:00 hdfs dfs -rm -r -f -skipTrash hdfs:///wmf/refinery/2023-04-26T10.55.26+00.00--scap_sync_2023-04-26_0004-dirty.tmp
2023-04-26T10:55:30+00:00 hdfs dfs -D fs.permissions.umask-mode=022 -mkdir hdfs:///wmf/refinery/2023-04-26T10.55.26+00.00--scap_sync_2023-04-26_0004-dirty.tmp
2023-04-26T10:55:32+00:00 hdfs dfs -D fs.permissions.umask-mode=022 -put -f airflow artifacts bin diagrams druid gobblin HACKING.md hive hql oozie packaged-environments python README.md setup.cfg spark static_data hdfs:///wmf/refinery/2023-04-26T10.55.26+00.00--scap_sync_2023-04-26_0004-dirty.tmp
2023-04-26T10:56:57+00:00 hdfs dfs -D fs.permissions.umask-mode=022 -put /dev/fd/63 hdfs:///wmf/refinery/2023-04-26T10.55.26+00.00--scap_sync_2023-04-26_0004-dirty.tmp/.deployment
2023-04-26T10:57:02+00:00 hdfs dfs -D fs.permissions.umask-mode=022 -mv hdfs:///wmf/refinery/2023-04-26T10.55.26+00.00--scap_sync_2023-04-26_0004-dirty.tmp hdfs:///wmf/refinery/2023-04-26T10.55.26+00.00--scap_sync_2023-04-26_0004-dirty
2023-04-26T10:57:04+00:00
2023-04-26T10:57:04+00:00 * Setting up 'current' version on cluster ...
2023-04-26T10:57:04+00:00 hdfs dfs -D fs.permissions.umask-mode=022 -cp hdfs:///wmf/refinery/2023-04-26T10.55.26+00.00--scap_sync_2023-04-26_0004-dirty hdfs:///wmf/refinery/current.tmp
2023-04-26T10:58:45+00:00 hdfs dfs -D fs.permissions.umask-mode=022 -mv hdfs:///wmf/refinery/current hdfs:///wmf/refinery/current.swap
2023-04-26T10:58:47+00:00 hdfs dfs -D fs.permissions.umask-mode=022 -mv hdfs:///wmf/refinery/current.tmp hdfs:///wmf/refinery/current
2023-04-26T10:58:49+00:00 hdfs dfs -rm -r -f -skipTrash hdfs:///wmf/refinery/current.swap
Deleted hdfs:///wmf/refinery/current.swap
2023-04-26T10:58:52+00:00 pass (Used parameters: --verbose --no-dry-run )

This is still only a workaround, so will need to work out what is the best long-term fix for the situation.

JArguello-WMF lowered the priority of this task from Unbreak Now! to High.Apr 26 2023, 4:11 PM
BTullis added a subscriber: MoritzMuehlenhoff.

As discussed in #wikimedia-analytics on IRC, @MoritzMuehlenhoff is going to make a patch to set a git::systemconfig value that adds /srv/deployment/refinery to the safe.directory list for the analytics_cluster::launcher as well as the A:hadoop-coordinator roles.

This should fix the deploy process for refinery-deploy-to-hdfs

@BTullis I've made a patch for this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/912301

Can you please review it, then I'll merge it on Tuesday and we can confirm with a Refinery test deployment?

BTullis moved this task from Blocked/Paused to Done on the Data Pipelines (Sprint 12) board.

This is the step that was failing for me, but it now succeeds after the change that you deployed @MoritzMuehlenhoff - Many thanks indeed.

btullis@an-launcher1002:/srv/deployment/analytics/refinery$ git log -n 1
commit 571f9558ee28d58db87237ca2e016464c5db595e (HEAD, tag: scap/sync/2023-04-26/0004, tag: scap/sync/2023-04-26/0003, tag: scap/sync/2023-04-26/0002, tag: scap/sync/2023-04-26/0001)
Author: Ben Tullis <btullis@wikimedia.org>
Date:   Tue Apr 25 14:58:20 2023 +0100

    Add guw.wikinews and kbd.wiktionary to the allowlist
    
    These wikis have been added recently, so must be added to the pageview
    allowlist.
    
    Bug: T334459
    Bug: T333266
    Change-Id: I88aafd403114c0c5684fa29e2bd04644b5c5651b

I'll resolve this incident, but we will revisit it if there are any other errors.

We experienced this again during today's (May 16 2023) refinery deploy.

stevemunene@an-launcher1002:/srv/deployment/analytics/refinery$ git log -n 1
fatal: detected dubious ownership in repository at '/srv/deployment/analytics/refinery-cache/revs/2a0b1f20473b319ea7f94b5c1b126afb40e0bc50'
To add an exception for this directory, call:

	git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/2a0b1f20473b319ea7f94b5c1b126afb40e0bc50

Ran the add safe directory command both for my user and for user hdfs.
`stevemunene@an-launcher1002:/srv/deployment/analytics/refinery$ sudo -u hdfs kerberos-run-command hdfs git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/2a0b1f20473b319ea7f94b5c1b126afb40e0bc50
`
`stevemunene@an-launcher1002:/srv/deployment/analytics/refinery$ git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/2a0b1f20473b319ea7f94b5c1b126afb40e0bc50
`
Verified we could run the git commands on my user

stevemunene@an-launcher1002:/srv/deployment/analytics/refinery$ git status
HEAD detached at 2a0b1f20
nothing to commit, working tree clean
stevemunene@an-launcher1002:/srv/deployment/analytics/refinery$ git log -n 1
commit 2a0b1f20473b319ea7f94b5c1b126afb40e0bc50 (HEAD, tag: scap/sync/2023-05-16/0001)
Merge: 896198c9 e70d952a
Author: Joal <joal@wikimedia.org>
Date:   Tue May 16 11:42:03 2023 +0000

    Merge "Add btm.wiktionary to pageview allowlist"

Then verified for analytics-deploy user

stevemunene@an-launcher1002:/srv/deployment/analytics/refinery$ sudo -u analytics-deploy -i

You do not have a valid Kerberos ticket in the credential cache, remember to kinit.
analytics-deploy@an-launcher1002:~$ cd /srv/deployment/analytics/refinery
analytics-deploy@an-launcher1002:/srv/deployment/analytics/refinery$ git log -n 1
commit 2a0b1f20473b319ea7f94b5c1b126afb40e0bc50 (HEAD, tag: scap/sync/2023-05-16/0001)
Merge: 896198c9 e70d952a
Author: Joal <joal@wikimedia.org>
Date:   Tue May 16 11:42:03 2023 +0000

    Merge "Add btm.wiktionary to pageview allowlist"

Change 920280 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the refinery-cache directory to the git safe list

https://gerrit.wikimedia.org/r/920280

Change 920280 merged by Btullis:

[operations/puppet@production] Add the refinery-cache directory to the git safe list

https://gerrit.wikimedia.org/r/920280

We had this error pop up again during this weeks refinery deploy.

stevemunene@an-launcher1002:/srv/deployment/analytics/refinery$ git log -n 1
fatal: detected dubious ownership in repository at '/srv/deployment/analytics/refinery-cache/revs/24ff363e223bc27d84aa44f4c82b8eaffada133e'
To add an exception for this directory, call:

	git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/24ff363e223bc27d84aa44f4c82b8eaffada133e

Temporarily solved it by running

sudo -u hdfs kerberos-run-command hdfs git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/24ff363e223bc27d84aa44f4c82b8eaffada133e

Which is not ideal, we should add the full directory /srv/deployment/analytics/refinery-cache/revs/ because git safe.directory as implemented on 920280 is not recursive and cannot view/grant the whole /srv/deployment/analytics/refinery-cache/revs/ folder.

Change 922905 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add the refinery-cache/revs directory to git safe list

https://gerrit.wikimedia.org/r/922905

Change 922905 abandoned by Btullis:

[operations/puppet@production] Add the refinery-cache/revs directory to git safe list

Reason:

Not going to be effective.

https://gerrit.wikimedia.org/r/922905

@MoritzMuehlenhoff made a useful suggestion on that patch, which I'll put here so I don't lose it.

If there a way to determine the hash to be used in /srv/deployment/analytics/refinery-cache/revs/HASH as part of the refinery-deploy-to-hdfs command?
If so, then we could also not bother with managing this in Puppet, but dynamically run
git config --global --add safe.directory /srv/deployment/analytics/refinery-cache/revs/HASH
and
git config --global --remove safe.directory /srv/deployment/analytics/refinery-cache/revs/HASH
as part of the script?

For now, engineers who are deploying refinery, usually on a weekly basis, still have to use the workaround manually.

Aklapper renamed this task from anlytics/refinery deployment broken at refinery-deploy-to-hdfs to analytics/refinery deployment broken at refinery-deploy-to-hdfs.Jul 19 2023, 1:14 PM

Change 950194 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Grant analytics-admins rights to run some git cmds as analytics-deploy

https://gerrit.wikimedia.org/r/950194

Change 950195 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/refinery@master] Use sudo with git in refinery_deploy_to_hdfs

https://gerrit.wikimedia.org/r/950195

It seems to me that the simplest solution is to use sudo to run the git commands as the analytics-deploy user.
The only git command being executed in the refinery-deploy-to-hdfs script is git describe so I think that this is fairly safe to add.

btullis@an-launcher1002:/srv/deployment/analytics/refinery$ sudo -u analytics-deploy git describe --always --dirty
scap/sync/2023-08-02/0001

We can also update the docs so that the git log -n 1 is replaced with sudo -u analytics-deploy /usr/bin/git log -n 1.

I've created a puppet patch to add the new sudo rules and a patch to refery to implement the change in the baehaviour in the script.

Change 950194 merged by Btullis:

[operations/puppet@production] Grant analytics-admins rights to run some git cmds as analytics-deploy

https://gerrit.wikimedia.org/r/950194

Change 950195 merged by Btullis:

[analytics/refinery@master] Use sudo with git in refinery_deploy_to_hdfs

https://gerrit.wikimedia.org/r/950195

This is now merged, but we're waiting for the first refinery-deploy after this, to validate whether or not it works as expected.
I have added a note to the deployment instructions stating this: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Deploy/Refinery#How_to_deploy

https://wikitech.wikimedia.org/w/index.php?title=Data_Engineering/Systems/Cluster/Deploy/Refinery&diff=prev&oldid=2111316

Moving this ticket to waiting, whilst we await confirmation.

Change 958927 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/refinery@master] Update refinery-deploy-to-hdfs to use sudo

https://gerrit.wikimedia.org/r/958927

Change 958927 merged by Joal:

[analytics/refinery@master] Update refinery-deploy-to-hdfs to use sudo

https://gerrit.wikimedia.org/r/958927