Page MenuHomePhabricator

Improve article-recommender scripts
Closed, ResolvedPublic

Description

The current implementation of ranking missing Wikipedia articles (article-recommender) is exploratory in nature. It's a set of scripts that are run one-after-another. Improve the codebase so that the repository is ready for production use.

A/C

  • parameterize
  • make sure script can be run by a system user and data is saved into hive instead of parquet/csv
  • scripts are ready to be run by Oozie
  • ...

Details

Related Gerrit Patches:
research/article-recommender : masterParametrize the script
research/article-recommender/deploy : masterVirtual environment: use system packages by default
research/article-recommender/deploy : masterAdd scripts for preparing requirements
research/article-recommender : masterAdd packaging instructions
research/article-recommender : masterGet scripts ready for use by Oozie

Event Timeline

bmansurov updated the task description. (Show Details)
bmansurov triaged this task as High priority.Jan 16 2019, 5:43 PM
bmansurov moved this task from Staged to In Progress on the Research board.
bmansurov updated the task description. (Show Details)Jan 18 2019, 9:09 PM

Change 485257 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/article-recommender@master] WIP: Get scripts ready for use by Oozie

https://gerrit.wikimedia.org/r/485257

@Ottomata @JAllemandou say I have a Python script that I want to run with Oozie. What's the standard way of logging application data? I want to throw things like logger.debug('abc') and be able to read logs easily.

Nuria added a subscriber: Nuria.Feb 13 2019, 12:28 AM

@bmansurov logging in a distributted platform logs in as many nodes as code gets executed so, until the job has finished logs are not available, you can probably get the logging context via pyspark spark context

Ya logging will work like normal, except it will be run distributed on many nodes at once as Nuria says. You can sort of sometimes get logs while the job is running, but it is much easier to wait until it is finished.

https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Logs

Specifically you want:

yarn logs -applicationId <app_id>

Where <app_id> is somethign like application_1547732626747_92892

Change 485257 merged by Bmansurov:
[research/article-recommender@master] Get scripts ready for use by Oozie

https://gerrit.wikimedia.org/r/485257

bmansurov updated the task description. (Show Details)

Change 492703 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/article-recommender@master] Add packaging instructions

https://gerrit.wikimedia.org/r/492703

Change 492703 merged by Bmansurov:
[research/article-recommender@master] Add packaging instructions

https://gerrit.wikimedia.org/r/492703

Change 494545 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/article-recommender/deploy@master] Add scripts for preparing requirements

https://gerrit.wikimedia.org/r/494545

Change 494545 merged by Bmansurov:
[research/article-recommender/deploy@master] Add scripts for preparing requirements

https://gerrit.wikimedia.org/r/494545

bmansurov added a comment.EditedMar 6 2019, 6:27 PM

@Nuria having thought about your comment, I think I misunderstood you the first time. I think you mean I should explore whether the way Discovery is doing this is applicable to the research's use case. Please let me know if I got it wrong this time too.

@Ottomata I'm trying to follow a similar setup to what Discovery is doing. Similar to Discovery, I need to ship some python dependency to HDFS so that I can run an Oozie job. That's why I didn't want to re-invent the wheel, but reuse what already is working. Do you think this way of doing things has any issues? If yes, what improvements would you suggest? Thanks!

Nuria added a comment.Mar 7 2019, 12:04 AM

Let's see, in your case when you were building the code in the stats machines, did you do it in the machine itself or where you spinning out a virtualenv or similar to install your custom deps for python?

discovery ships (rather "downloads") dependencies as part of the job cause they do not use the dependencies deployed on the nodes as they are, rather they need a specific set of versions. Oozie already has access to dependencies that are already deployed in hadoop nodes which should be the same ones we have in the stats machines.

I was relying on a local virtual environment on stat1007. I have refactored the code and created a package and uploaded it to PyPi so that I can make it a dependency of the Oozie script. This way I can have a simple entry point for Oozie that depends on this external package. If we want to add more recommendation types, or improve the article recommendation code, then all I have to do is update the package and point Oozie to use the new version.

In fact, Discovery's only requirement is the 'kafka' package. You could argue that this package should live on HDFS by default and there's no need for discovery to create their package mechanism. In my case, unlike kafka, the package has just been created and not used by anyone else. This is a perfect case for uploading my package to Archiva, rather than requesting this to be installed in all of HDFS or creating a Debian package out of it.

Another advantage of going this way is that I can be flexible and not be slowed down or blocked by other teams (because I have to ask them to install the new version of the package in HDFS, etc.) when iterating and delivering a working solution is important. Similar to Discovery, I also need access to a specific version of the package, that won't be readily available in Hadoop given that I'll be adding new features to the package regularly.

I think the way Baho is doing this is fine. It isn't that different than how we package up dependencies and artifacts for Java in e.g. refinery, or for Python in e.g. superset, or for Node in e.g. EventGate and change-prop. In the superset and change-prop cases, we use a separate 'deploy' git repository for the dependency artifacts. In the refinery case, we use Archiva + git-fat to avoid keeping the dependencies in git. This is also to how the ORES folks are packaging, except they use git-lfs somehow.

The fact that Baho is running Hadoop probably makes using Archiva even easier for his use case, because he can depend directly on an artifact in a maven repository.

This general idea sounds like a pretty solid one. I doubt that using Archiva is going to be our final solution for T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production , but something that can be used like Baho is using it probably will be. For now I think this is an ok workaround.

The main downside is that there's no good automated way to push the model artifacts from Hadoop to archiva, Baho is going to have to download them and then upload them manually, just like Discovery does. But I think this will be fine for now.

Nuria added a comment.Mar 7 2019, 7:02 PM

In fact, Discovery's only requirement is the 'kafka' package. You could argue that this package should live on HDFS by default and there's no need for discovery to create their package mechanism. In my case, unlike kafka, the package has just been created and not used >by anyone else. This is a perfect case for uploading my package to Archiva, rather than requesting this to be installed in all of HDFS or creating a Debian package out of it.

Indeed, sounds good.

@Ottomata I need a little help. So here's the situation. The code to generate python whl files is in research/article-recommender/deploy. I can generate those files and upload them to Archiva no problem; see this for example.

For my Oozie job I need to submit a zip file that is a virtual environment for the job, that will have the above whl file installed. Similar to Discovery, I was hoping to create a checks file that generates this zip file and puts it in the artifacts folder of analytics/refinery. However, the dependency whl file lives in the artifacts folder of the research/article-recommender/deploy repo, and not in the analytics/refinery repo.

So my question is how do I get those whl file into the artifacts folder of analytics/refinery? Is there a way to do it automatically? Should I go back to the previous approach where I was generating the whl file as part of analytics/refinery? I could then run scap deploy in a deployment server, which will generate the whl file and create the virtual environment zip file. What do you think? Thanks!

Hm, why do you need the the artifact in analytics/refinery? Can you just use scap+git fat to deploy research/article-recommender/deploy to e.g. stat1007 and put the zip file where ever it needs to go (HDFS?).

Or, if you want to have your artifact deployed in analytics/refinery/artifacts, you can manually git add it there (as long as your local analytics/refinery checkout has git fat initialized.)

Hm, why do you need the the artifact in analytics/refinery? Can you just use scap+git fat to deploy research/article-recommender/deploy to e.g. stat1007 and put the zip file where ever it needs to go (HDFS?).

Because when I run my Oozie job, I want to easily find the zip file relative to the job file location. I could deploy research/article-recommender/deploy to stat1007, but that part is already taken care of in analytics/refinery and I want to re-use what's available already without creating a separate similar setup.

Or, if you want to have your artifact deployed in analytics/refinery/artifacts, you can manually git add it there (as long as your local analytics/refinery checkout has git fat initialized.)

I see, I guess I could do this. Thanks!

Change 498498 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/article-recommender@master] WIP: Parametrize the script

https://gerrit.wikimedia.org/r/498498

Change 498499 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/article-recommender/deploy@master] Virtual environment: use system packages by default

https://gerrit.wikimedia.org/r/498499

Change 498499 merged by Bmansurov:
[research/article-recommender/deploy@master] Virtual environment: use system packages by default

https://gerrit.wikimedia.org/r/498499

Change 498498 merged by Bmansurov:
[research/article-recommender@master] Parametrize the script

https://gerrit.wikimedia.org/r/498498

bmansurov closed this task as Resolved.Mar 29 2019, 5:09 PM
bmansurov updated the task description. (Show Details)
bmansurov moved this task from In Progress to Done (current quarter) on the Research board.