Page MenuHomePhabricator

Update git lfs on stat1006/7
Closed, ResolvedPublic3 Estimated Story Points

Description

Git lfs on stat1007 is a little broken. It doesn't add files to lfs as expected and that causes problems. The current version on stat1007 is 2.3.4. The current available version is 2.6.1.

My current workaround is to commit large binaries my scp'ing them to my local machine and then pushing them (to git-lfs) from there. It would be nice to push files directly from stat1006/7

Event Timeline

Milimetric triaged this task as Medium priority.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.
Milimetric added a project: Analytics-Kanban.

Hey Aaron,

I can see the following:

elukey@stat1007:~$ apt-cache policy git-lfs
git-lfs:
  Installed: 2.3.4-1
  Candidate: 2.3.4-1
  Version table:
     2.6.1-1~bpo9+1 100
        100 http://mirrors.wikimedia.org/debian stretch-backports/main amd64 Packages
 *** 2.3.4-1 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status


elukey@stat1006:~$ apt-cache policy git-lfs
git-lfs:
  Installed: (none)
  Candidate: 2.3.4-1
  Version table:
     2.6.1-1~bpo9+1 100
        100 http://mirrors.wikimedia.org/debian stretch-backports/main amd64 Packages
     2.3.4-1 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages

So on stat1007 the package is deployed but the version that you'd like, 2.6.1, is in stretch's backports (this is why 2.3.4 is deployed) meanwhile on stat1006 is not installed. I believe that the package is deployed due to a scap puppet class (that requires the package), but it is not deployed as part of the set of statistics package.

I wasn't aware of git-lfs, is it preferred compared to git fat (that is what we "support" on the stat boxes) ? Second question is - are the issues that you found in 2.3.4 resolved by 2.6.1?

I assumed that stat1006/7 would have the same basic puppet config. I wonder why there is a difference.

The problems are resolved by git lfs 2.6.1. I can't say whether or not any intermediary versions also resolve the issues. Essentially LFS is just plain broken in 2.3.4 (won't add new files to LFS) so we are unable to use it on stat1007 to push new files. Git-lfs is part of ORES development and deployment process (supported by Release-Engineering-Team). There are many reasons that Git-lfs is desirable over alternatives like git fat, but I'm not sure this is a right place to debate that.

Thanks for the feedback!

I assumed that stat1006/7 would have the same basic puppet config. I wonder why there is a difference.

I think that since stat1007 is a scap target (for refinery) it gets the package installed as side effect, meanwhile it doesn't happen on stat1006.

The problems are resolved by git lfs 2.6.1. I can't say whether or not any intermediary versions also resolve the issues. Essentially LFS is just plain broken in 2.3.4 (won't add new files to LFS) so we are unable to use it on stat1007 to push new files. Git-lfs is part of ORES development and deployment process (supported by Release-Engineering-Team). There are many reasons that Git-lfs is desirable over alternatives like git fat, but I'm not sure this is a right place to debate that.

No intention to debate, I was only trying to understand the use case to see if other tools could have done the job. Sounds something needed on the stat boxes, I'll file a puppet code change to deploy the 2.6.1 version asap.

Great! Thank you for your help with this :)

I assumed that stat1006/7 would have the same basic puppet config

Just FYI: They don't. They share some common things, but they apply different puppet roles and have different access patterns (e.g. private data is accessible from stat1007, but not 1006).

Change 485852 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::packages::statistics: deploy git-lfs

https://gerrit.wikimedia.org/r/485852

Change 485852 merged by Elukey:
[operations/puppet@production] profile::analytics::packages::statistics: deploy git-lfs

https://gerrit.wikimedia.org/r/485852

@Halfak should be done! Can you check and confirm?

elukey set the point value for this task to 3.

Looks like we are using git fat here as a means to deploy from stats machines to prod and we need a better way to do that, the stats machines should be used for research and one off computations, they should not be in the path of deployments to production infrastructure. Let's follow up on this once we know whether we are gonna use swift.

Looks like we are using git fat here as a means to deploy from stats machines to prod and we need a better way to do that, the stats machines should be used for research and one off computations, they should not be in the path of deployments to production infrastructure. Let's follow up on this once we know whether we are gonna use swift.

That's a very good point. You mentioning this reminded me that we can use docker jobs for that. Will spec this out more in a phab ticket but it's out of scope of this ticket and won't happen soon.

@Ladsgroup sounds good, just mention other ticket in this one so we can follow that work.

Thank you! I've confirmed that this works. Sorry for the delay @elukey. The last couple weeks have been a bit unusual.

Regarding the point that @Nuria made about where we're building models, I think that these machines might not be that problematic for the role they play. It's important to understand that model building is a research/development activity. We're not including these machines as part of our deployment process any more than we are using our individual laptops as part of our deployment process. The models that we build are committed to our repos via Git LFS. Then there's an independent deployment process that pulls the models from the same repos.

But that said, there are some other concerns worth considering. My biggest concern is that we've worked to match our production environment to the environment of the stat machines. This allows us to ensure that all of the inner workings of the models stay consistent between the research/development work we do on the stat machines and the prod environment. But these things can become out of sync and that should be OK. Research machines should be able to become out of sync with our prod cluster and vice versa. So, in order to get around this issue, we'd need to have a relatively capable build environment outside of the stat machines. We need lots of CPU and memory to do the builds that we are doing, so such an environment could be expensive and we might under-utilize it if we follow the same pattern of setting up the environment on bare hardware. @Ladsgroup suggests that maybe we can use docker jobs to build the models. That could improve on utilization but it might also be disruptive to the development workflow as experimentation and research are key components of modeling work.

Models used in production should be build in hadoop and pushed directly from there wherever they go, for this there is some work that our team needs to on hadoop/swift/GPUs and in general in making easier running models on cluster. We will be working on a program arround machine learning infrastructure for next year and will consider there also ores use cases.

My biggest concern is that we've worked to match our production environment to the environment of the stat machinesff

My point here is that stats machines are not tier-1 production infrastructure, they serve a different goal. The deployment pipeline for models that are consumed in prod needs to work even if the stats machines are turned off.

So, in order to get around this issue, we'd need to have a relatively capable build environment outside of the stat machines.

We do, on the hadoop cluster which is build for this purpose.

With that, I think we can resolve this task.