Generate article recommendations in Hadoop for use in production
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• bmansurov
	Nov 30 2018, 3:25 PM

Description

We want to create an Oozie job that generates article normalized scores in Hadoop.

A/C

Take the PySpark scripts that work on stat1007 and turn them into an Oozie job.
Setup the job to generate new recommendations quarterly or later when new Wikidata dumps are avaiable. Waiting on T209655: Copy Wikidata dumps to HDFS + parquet.

Details

Subject	Repo	Branch	Lines +/-
Oozie article recommender: use version 0.0.2	analytics/refinery	master	+40 -46
Oozie: wait for new Wikidata dumps before generating article recommendations	analytics/refinery	master	+17 -1
Oozie: add article recommender	analytics/refinery	master	+421 -0
WIP: Oozie: Add article recomender job	analytics/refinery	master	+196 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T214074 Productionize article recommender systems
Resolved	• bmansurov	T210844 Generate article recommendations in Hadoop for use in production
Resolved	• bmansurov	T211981 Improve article-recommender scripts
Resolved	jbond	T217447 Add bmansurov to archiva-deployers LDAP group
Invalid	None	T218955 Some worker nodes don't seem to have numpy installed
Resolved	Dzahn	T219309 Create research-alerts mailing list
Resolved	JAllemandou	T209655 Copy Wikidata dumps to HDFS + parquet
Resolved	JAllemandou	T243832 Fix hdfs-rsync`prune-empty-dirs` feature

Event Timeline

• bmansurov created this task.Nov 30 2018, 3:25 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 30 2018, 3:25 PM

• bmansurov added a project: Research.Nov 30 2018, 3:26 PM

• bmansurov updated the task description. (Show Details)

@bmansurov Let's take a step back and document how will data be created and loaded into mysql, deploying a service in production is quite a rigorous process. Having an overall description of the data workflow will be of help for you to coordinate with different teams. The work to be done here will include hadoop work, puppet work but also coordination with SRE teams. Probably a 2/4 page design document will suffice. This might exist and we might have not seen it.

Examples of questions the design document needs to answer:

how is the data created going to make it to your mysql storage? You will need to coordinate with SRES/DBAs here, mysql storage is in production, hadoop is in the analytics vlan. This needs to be solved before you start working on data generation, so i will tackle this problem first (you might already have)
how is data produced (sounds like now this is pyspark, I think if we want to run this on cluster it probably needs to be migrated to scala spark, not sure)

how is data updated? is it overwritten every time recommendations are recreated?

how is data retrieved by the final client, is it an API build with restbase?

how is data produced (sounds like now this is pyspark, I think if we want to run this on cluster it probably needs to be migrated to scala spark, not sure)

pyspark runs in YARN just as Scala does.

• fdans moved this task from Incoming to Radar on the Analytics board.Dec 3 2018, 5:15 PM

@Nuria hopefully the new task description makes it clear what we're trying to achieve. Please let me know if anything else needs clarification. Thanks!

Thanks for the description. It helps. I think a design doc for this will help you clarify what you expect from each teams. It does not have to be very detailed but enough that communicates how is this service going to function. For example, a question that comes to mind:

With every new version of recommendations data, we'll create separate set of MySQL tables and keep old versions while we transition the API to use the new version of the data

Was this the recommendation of the DBAS? Seems a bit odd of a choice to create new tables to host the data rather than add data to tables and have a process that removes data no longer used couple weeks later . Seems that you are going to have to write a bunch of code to manage data ingestion in the mysql end. Think that the output of the hadoop cluster is likely going to be tsv files that "somehow" will appear on the MySQl hosts. You can take it from there but automating new table generation from those data files plus api updates to pull from new tables (which will require a config push to the api endpoint) seems less efficient than loading new data into existing tables and letting the sql take care of surfacing the newest data . Using the same tables gives you seamless deploys and much less of a maintenance burden.

• bmansurov mentioned this in T208622: Import recommendations into production database.Dec 12 2018, 2:13 PM

leila added a parent task: T203041: Output 2.1: An improved task recommendation API.Dec 20 2018, 3:12 AM

• bmansurov claimed this task.Jan 2 2019, 2:47 PM

• bmansurov edited parent tasks, added: T148129: Productization of Recommendation API; removed: T203041: Output 2.1: An improved task recommendation API.Jan 9 2019, 9:48 PM

• Mholloway subscribed.Jan 9 2019, 9:52 PM

• bmansurov updated the task description. (Show Details)Jan 11 2019, 5:19 PM

• bmansurov moved this task from Backlog to In Progress on the Research board.Jan 11 2019, 5:22 PM

• bmansurov added a project: Article-Recommendation.

• bmansurov mentioned this in T213566: Transferring data from Hadoop to production MySQL database.Jan 11 2019, 5:28 PM

• bmansurov edited parent tasks, added: T214074: Productionize article recommender systems; removed: T148129: Productization of Recommendation API.Jan 17 2019, 7:08 PM

• bmansurov removed a subtask: T211980: 'morelike' recommendation API: Bulk import data to MySQL in chunks.Jan 18 2019, 2:35 PM

• bmansurov mentioned this in T211980: 'morelike' recommendation API: Bulk import data to MySQL in chunks.Jan 18 2019, 7:47 PM

@Nuria can you share any Oozie job patches in Gerrit where I can see how Pyspark jobs are being run? The documentation on Wikitech is sparse. Thanks.

I think this the one I can think of: https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/oozie/transfer_to_es/coordinator.properties

@Nuria thanks. I've gone through the Oozie documentation, but it doesn't mention the workflow needed for setting up an Oozie job in production. Things like which repository the job needs to go, how are repositories pulled, etc. Is there a canonical document for that?

@bmansurov There isn't a canonical repository for Oozie job definitions, they can live anywhere. It looks like discovery has their own repository for them.

I think if I were you I'd keep them in the analytics/refinery repository with the rest of our Oozie jobs for now. That way you can use our sub workflows, etc. more easily.

• bmansurov mentioned this in T217447: Add bmansurov to archiva-deployers LDAP group.Mar 1 2019, 9:15 PM

• bmansurov added a subtask: T217447: Add bmansurov to archiva-deployers LDAP group.

Change 493762 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[analytics/refinery@master] WIP: Oozie: Add article recomender job

https://gerrit.wikimedia.org/r/493762

gerritbot added a project: Patch-For-Review.Mar 1 2019, 9:30 PM

jbond closed subtask T217447: Add bmansurov to archiva-deployers LDAP group as Resolved.Mar 6 2019, 10:41 AM

Change 496885 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[analytics/refinery@master] WIP: Add workflow for article-recommender

https://gerrit.wikimedia.org/r/496885

Change 493762 abandoned by Bmansurov:
WIP: Oozie: Add article recomender job

Reason:
not needed for now

https://gerrit.wikimedia.org/r/493762

• bmansurov closed subtask T218955: Some worker nodes don't seem to have numpy installed as Invalid.Mar 21 2019, 10:14 PM

Change 496885 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[analytics/refinery@master] WIP: Oozie: add article recommender

https://gerrit.wikimedia.org/r/496885

• bmansurov closed subtask T211981: Improve article-recommender scripts as Resolved.Mar 29 2019, 5:09 PM

Dzahn closed subtask T219309: Create research-alerts mailing list as Resolved.Apr 5 2019, 1:17 PM

• bmansurov moved this task from Backlog to In-progress on the Article-Recommendation board.Apr 5 2019, 4:47 PM

Change 496885 merged by Ottomata:
[analytics/refinery@master] Oozie: add article recommender

https://gerrit.wikimedia.org/r/496885

• bmansurov removed a project: Patch-For-Review.Apr 8 2019, 2:40 PM

• bmansurov updated the task description. (Show Details)

• bmansurov added a subtask: T209655: Copy Wikidata dumps to HDFS + parquet.

• Nuria added a project: Analytics-Kanban.Apr 8 2019, 4:02 PM

• Nuria moved this task from Next Up to Ready to Deploy on the Analytics-Kanban board.

Change 503393 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[analytics/refinery@master] Oozie: wait for new Wikidata dumps before generating article recommendations

https://gerrit.wikimedia.org/r/503393

gerritbot added a project: Patch-For-Review.Apr 12 2019, 4:17 PM

Hi @bmansurov,
I've been monitoring the current run of the recommender (https://yarn.wikimedia.org/proxy/application_1553764233554_69057/), and I think the approach in term of cluster usage is not sustainable. Particularly, generating the pageviews datasets for the top 50 languages should be done in a single pass, and written in partitioned folders (using partitionBy). Currently, there are 50 runs each reading the whole pageview needed data multiple times - This represents reading 50 *2 *1Tb of data. Doing it in one pass will save a lot of IOs as well as a lot of time :)

@JAllemandou, that makes sense. I'm currently measuring the times spent running each operation as part of T220520: Improve article-recommender script, and your suggestion makes sense. The process is almost done, it's been two days, but I think by the end of the day it should be over.

Edit: I see the job got killed. I hope it wasn't because I was using too much resources.

Actually it'll not finish - We just killed it as we need to restart the cluster (planned maintenance - see https://lists.wikimedia.org/pipermail/engineering/2019-April/000695.html). Sorry for that :( Hopefully you'll still have enough logs.

JAllemandou moved this task from Ready to Deploy to In Code Review on the Analytics-Kanban board.Apr 19 2019, 11:09 AM

Change 506218 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[analytics/refinery@master] Oozie article recommender: use version 0.0.2

https://gerrit.wikimedia.org/r/506218

• fdans removed a project: Analytics-Kanban.Apr 29 2019, 3:38 PM

Change 503393 merged by Ottomata:
[analytics/refinery@master] Oozie: wait for new Wikidata dumps before generating article recommendations

https://gerrit.wikimedia.org/r/503393

Change 506218 merged by Ottomata:
[analytics/refinery@master] Oozie article recommender: use version 0.0.2

https://gerrit.wikimedia.org/r/506218

@bmansurov, I've merged your changes, but haven't deployed them. They should go out with the next deploy. The oozie jobs will then need to be restarted.

Thanks, @Ottomata!

leila moved this task from In Progress to Backlog on the Research board.Jul 11 2019, 12:38 AM

leila edited projects, added Research-Freezer; removed Research.Jul 11 2019, 3:49 PM

leila removed a project: Research-Freezer.Nov 20 2019, 12:31 AM

• Nuria closed subtask T209655: Copy Wikidata dumps to HDFS + parquet as Resolved.Feb 27 2020, 7:30 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM

@Ottomata et al: All related patches in Gerrit merged or abandoned. Is there more to do in this task? Or can task status be set to resolved?

Aklapper edited projects, added Analytics-Clusters; removed Patch-For-Review.Apr 2 2021, 9:25 AM

@bmansurov to answer?

Ottomata removed a project: Analytics-Clusters.Apr 19 2021, 3:37 PM

@bmansurov: Could you please answer the last comment? Thanks in advance!

I'll call it resolved and if someone disagrees they can reopen it.

leila closed this task as Resolved.Jun 2 2021, 10:42 PM

Generate article recommendations in Hadoop for use in productionClosed, ResolvedPublicActions

Description

A/C

Details

Related ObjectsSearch...

Event Timeline

Generate article recommendations in Hadoop for use in production
Closed, ResolvedPublic
Actions

Related Objects
Search...