Maniphest T211981

Improve article-recommender scripts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• bmansurov
	Dec 14 2018, 2:54 PM

Description

The current implementation of ranking missing Wikipedia articles (article-recommender) is exploratory in nature. It's a set of scripts that are run one-after-another. Improve the codebase so that the repository is ready for production use.

A/C

parameterize
make sure script can be run by a system user and data is saved into hive instead of parquet/csv
scripts are ready to be run by Oozie
...

Details

Subject	Repo	Branch	Lines +/-
Parametrize the script	research/article-recommender	master	+332 -406
Virtual environment: use system packages by default	research/article-recommender/deploy	master	+11 -1
Add scripts for preparing requirements	research/article-recommender/deploy	master	+182 -1
Add packaging instructions	research/article-recommender	master	+27 -0
Get scripts ready for use by Oozie	research/article-recommender	master	+666 -1 K

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T214074 Productionize article recommender systems
Resolved	• bmansurov	T210844 Generate article recommendations in Hadoop for use in production
Resolved	• bmansurov	T211981 Improve article-recommender scripts

Event Timeline

• bmansurov created this task.Dec 14 2018, 2:54 PM

• bmansurov updated the task description. (Show Details)

• bmansurov added a project: Article-Recommendation.Dec 14 2018, 3:42 PM

• bmansurov triaged this task as High priority.Jan 16 2019, 5:43 PM

• bmansurov moved this task from Backlog to In Progress on the Research board.

• bmansurov updated the task description. (Show Details)Jan 18 2019, 9:09 PM

Change 485257 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/article-recommender@master] WIP: Get scripts ready for use by Oozie

https://gerrit.wikimedia.org/r/485257

gerritbot added a project: Patch-For-Review.Jan 18 2019, 9:09 PM

• bmansurov claimed this task.Jan 18 2019, 9:09 PM

• bmansurov mentioned this in rRARCdfaf492c0ca4: WIP: Get scripts ready for use by Oozie.Jan 18 2019, 9:41 PM

• bmansurov mentioned this in T213566: Transferring data from Hadoop to production MySQL database.Jan 22 2019, 5:04 PM

@Ottomata @JAllemandou say I have a Python script that I want to run with Oozie. What's the standard way of logging application data? I want to throw things like logger.debug('abc') and be able to read logs easily.

• bmansurov mentioned this in rRARCf48cfdfe26ec: WIP: Get scripts ready for use by Oozie.Jan 23 2019, 10:27 PM

• bmansurov mentioned this in rRARCb3d48417bb75: WIP: Get scripts ready for use by Oozie.Feb 12 2019, 11:41 PM

@bmansurov logging in a distributted platform logs in as many nodes as code gets executed so, until the job has finished logs are not available, you can probably get the logging context via pyspark spark context

Ya logging will work like normal, except it will be run distributed on many nodes at once as Nuria says. You can sort of sometimes get logs while the job is running, but it is much easier to wait until it is finished.

https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Logs

Specifically you want:

yarn logs -applicationId <app_id>

Where <app_id> is somethign like application_1547732626747_92892

• bmansurov mentioned this in rRARCa5cdc43e1cc1: Get scripts ready for use by Oozie.Feb 13 2019, 6:24 PM

Change 485257 merged by Bmansurov:
[research/article-recommender@master] Get scripts ready for use by Oozie

https://gerrit.wikimedia.org/r/485257

• bmansurov removed a project: Patch-For-Review.Feb 13 2019, 6:27 PM

• bmansurov updated the task description. (Show Details)

Change 492703 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/article-recommender@master] Add packaging instructions

https://gerrit.wikimedia.org/r/492703

gerritbot added a project: Patch-For-Review.Feb 25 2019, 3:53 PM

Change 492703 merged by Bmansurov:
[research/article-recommender@master] Add packaging instructions

https://gerrit.wikimedia.org/r/492703

• bmansurov mentioned this in rRARCe06cf0abcbe8: Add packaging instructions.Feb 25 2019, 4:33 PM

• bmansurov mentioned this in rRARC99e116fe928c: Add packaging instructions.

Change 494545 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/article-recommender/deploy@master] Add scripts for preparing requirements

https://gerrit.wikimedia.org/r/494545

Change 494545 merged by Bmansurov:
[research/article-recommender/deploy@master] Add scripts for preparing requirements

https://gerrit.wikimedia.org/r/494545

• bmansurov mentioned this in rRARDa76fe0f14fa9: Add scripts for preparing requirements.Mar 5 2019, 8:16 PM

• bmansurov mentioned this in rRARDfbb5f2b74d2b: Add scripts for preparing requirements.

@Nuria having thought about your comment, I think I misunderstood you the first time. I think you mean I should explore whether the way Discovery is doing this is applicable to the research's use case. Please let me know if I got it wrong this time too.

@Ottomata I'm trying to follow a similar setup to what Discovery is doing. Similar to Discovery, I need to ship some python dependency to HDFS so that I can run an Oozie job. That's why I didn't want to re-invent the wheel, but reuse what already is working. Do you think this way of doing things has any issues? If yes, what improvements would you suggest? Thanks!

Let's see, in your case when you were building the code in the stats machines, did you do it in the machine itself or where you spinning out a virtualenv or similar to install your custom deps for python?

discovery ships (rather "downloads") dependencies as part of the job cause they do not use the dependencies deployed on the nodes as they are, rather they need a specific set of versions. Oozie already has access to dependencies that are already deployed in hadoop nodes which should be the same ones we have in the stats machines.

I was relying on a local virtual environment on stat1007. I have refactored the code and created a package and uploaded it to PyPi so that I can make it a dependency of the Oozie script. This way I can have a simple entry point for Oozie that depends on this external package. If we want to add more recommendation types, or improve the article recommendation code, then all I have to do is update the package and point Oozie to use the new version.

In fact, Discovery's only requirement is the 'kafka' package. You could argue that this package should live on HDFS by default and there's no need for discovery to create their package mechanism. In my case, unlike kafka, the package has just been created and not used by anyone else. This is a perfect case for uploading my package to Archiva, rather than requesting this to be installed in all of HDFS or creating a Debian package out of it.

Another advantage of going this way is that I can be flexible and not be slowed down or blocked by other teams (because I have to ask them to install the new version of the package in HDFS, etc.) when iterating and delivering a working solution is important. Similar to Discovery, I also need access to a specific version of the package, that won't be readily available in Hadoop given that I'll be adding new features to the package regularly.

I think the way Baho is doing this is fine. It isn't that different than how we package up dependencies and artifacts for Java in e.g. refinery, or for Python in e.g. superset, or for Node in e.g. EventGate and change-prop. In the superset and change-prop cases, we use a separate 'deploy' git repository for the dependency artifacts. In the refinery case, we use Archiva + git-fat to avoid keeping the dependencies in git. This is also to how the ORES folks are packaging, except they use git-lfs somehow.

The fact that Baho is running Hadoop probably makes using Archiva even easier for his use case, because he can depend directly on an artifact in a maven repository.

This general idea sounds like a pretty solid one. I doubt that using Archiva is going to be our final solution for T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production , but something that can be used like Baho is using it probably will be. For now I think this is an ok workaround.

The main downside is that there's no good automated way to push the model artifacts from Hadoop to archiva, Baho is going to have to download them and then upload them manually, just like Discovery does. But I think this will be fine for now.

In fact, Discovery's only requirement is the 'kafka' package. You could argue that this package should live on HDFS by default and there's no need for discovery to create their package mechanism. In my case, unlike kafka, the package has just been created and not used >by anyone else. This is a perfect case for uploading my package to Archiva, rather than requesting this to be installed in all of HDFS or creating a Debian package out of it.

Indeed, sounds good.

@Ottomata I need a little help. So here's the situation. The code to generate python whl files is in research/article-recommender/deploy. I can generate those files and upload them to Archiva no problem; see this for example.

For my Oozie job I need to submit a zip file that is a virtual environment for the job, that will have the above whl file installed. Similar to Discovery, I was hoping to create a checks file that generates this zip file and puts it in the artifacts folder of analytics/refinery. However, the dependency whl file lives in the artifacts folder of the research/article-recommender/deploy repo, and not in the analytics/refinery repo.

So my question is how do I get those whl file into the artifacts folder of analytics/refinery? Is there a way to do it automatically? Should I go back to the previous approach where I was generating the whl file as part of analytics/refinery? I could then run scap deploy in a deployment server, which will generate the whl file and create the virtual environment zip file. What do you think? Thanks!

Hm, why do you need the the artifact in analytics/refinery? Can you just use scap+git fat to deploy research/article-recommender/deploy to e.g. stat1007 and put the zip file where ever it needs to go (HDFS?).

Or, if you want to have your artifact deployed in analytics/refinery/artifacts, you can manually git add it there (as long as your local analytics/refinery checkout has git fat initialized.)

In T211981#5017363, @Ottomata wrote:

Hm, why do you need the the artifact in analytics/refinery? Can you just use scap+git fat to deploy research/article-recommender/deploy to e.g. stat1007 and put the zip file where ever it needs to go (HDFS?).

Because when I run my Oozie job, I want to easily find the zip file relative to the job file location. I could deploy research/article-recommender/deploy to stat1007, but that part is already taken care of in analytics/refinery and I want to re-use what's available already without creating a separate similar setup.

In T211981#5017366, @Ottomata wrote:

Or, if you want to have your artifact deployed in analytics/refinery/artifacts, you can manually git add it there (as long as your local analytics/refinery checkout has git fat initialized.)

I see, I guess I could do this. Thanks!

jijiki mentioned this in rRARC305fe132226f: Get scripts ready for use by Oozie.Mar 18 2019, 4:41 PM

Change 498498 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/article-recommender@master] WIP: Parametrize the script

https://gerrit.wikimedia.org/r/498498

Change 498499 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/article-recommender/deploy@master] Virtual environment: use system packages by default

https://gerrit.wikimedia.org/r/498499

Change 498499 merged by Bmansurov:
[research/article-recommender/deploy@master] Virtual environment: use system packages by default

https://gerrit.wikimedia.org/r/498499

• bmansurov mentioned this in rRARC0c60bf1b3b08: WIP: Parametrize the script.Mar 22 2019, 9:28 PM

• bmansurov mentioned this in rRARD37d9f93f39ac: Virtual environment: use system packages by default.Mar 22 2019, 9:57 PM

Change 498498 merged by Bmansurov:
[research/article-recommender@master] Parametrize the script

https://gerrit.wikimedia.org/r/498498

• bmansurov mentioned this in rRARC71e4097c2c1c: Parametrize the script.Mar 25 2019, 3:17 PM

• bmansurov closed this task as Resolved.Mar 29 2019, 5:09 PM

• bmansurov updated the task description. (Show Details)

• bmansurov moved this task from In Progress to Done (current quarter) on the Research board.