Page MenuHomePhabricator

Generate article recommendations in Hadoop for use in production
Closed, ResolvedPublic

Description

We want to create an Oozie job that generates article normalized scores in Hadoop.

A/C

  • Take the PySpark scripts that work on stat1007 and turn them into an Oozie job.
  • Setup the job to generate new recommendations quarterly or later when new Wikidata dumps are avaiable. Waiting on T209655: Copy Wikidata dumps to HDFS + parquet.

Event Timeline

@bmansurov Let's take a step back and document how will data be created and loaded into mysql, deploying a service in production is quite a rigorous process. Having an overall description of the data workflow will be of help for you to coordinate with different teams. The work to be done here will include hadoop work, puppet work but also coordination with SRE teams. Probably a 2/4 page design document will suffice. This might exist and we might have not seen it.

Examples of questions the design document needs to answer:

  • how is the data created going to make it to your mysql storage? You will need to coordinate with SRES/DBAs here, mysql storage is in production, hadoop is in the analytics vlan. This needs to be solved before you start working on data generation, so i will tackle this problem first (you might already have)
  • how is data produced (sounds like now this is pyspark, I think if we want to run this on cluster it probably needs to be migrated to scala spark, not sure)
  • how is data updated? is it overwritten every time recommendations are recreated?
  • how is data retrieved by the final client, is it an API build with restbase?

how is data produced (sounds like now this is pyspark, I think if we want to run this on cluster it probably needs to be migrated to scala spark, not sure)

pyspark runs in YARN just as Scala does.

@Nuria hopefully the new task description makes it clear what we're trying to achieve. Please let me know if anything else needs clarification. Thanks!

Thanks for the description. It helps. I think a design doc for this will help you clarify what you expect from each teams. It does not have to be very detailed but enough that communicates how is this service going to function. For example, a question that comes to mind:

With every new version of recommendations data, we'll create separate set of MySQL tables and keep old versions while we transition the API to use the new version of the data

Was this the recommendation of the DBAS? Seems a bit odd of a choice to create new tables to host the data rather than add data to tables and have a process that removes data no longer used couple weeks later . Seems that you are going to have to write a bunch of code to manage data ingestion in the mysql end. Think that the output of the hadoop cluster is likely going to be tsv files that "somehow" will appear on the MySQl hosts. You can take it from there but automating new table generation from those data files plus api updates to pull from new tables (which will require a config push to the api endpoint) seems less efficient than loading new data into existing tables and letting the sql take care of surfacing the newest data . Using the same tables gives you seamless deploys and much less of a maintenance burden.

@Nuria can you share any Oozie job patches in Gerrit where I can see how Pyspark jobs are being run? The documentation on Wikitech is sparse. Thanks.

@Nuria thanks. I've gone through the Oozie documentation, but it doesn't mention the workflow needed for setting up an Oozie job in production. Things like which repository the job needs to go, how are repositories pulled, etc. Is there a canonical document for that?

@bmansurov There isn't a canonical repository for Oozie job definitions, they can live anywhere. It looks like discovery has their own repository for them.

I think if I were you I'd keep them in the analytics/refinery repository with the rest of our Oozie jobs for now. That way you can use our sub workflows, etc. more easily.

Change 493762 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[analytics/refinery@master] WIP: Oozie: Add article recomender job

https://gerrit.wikimedia.org/r/493762

Change 496885 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[analytics/refinery@master] WIP: Add workflow for article-recommender

https://gerrit.wikimedia.org/r/496885

Change 493762 abandoned by Bmansurov:
WIP: Oozie: Add article recomender job

Reason:
not needed for now

https://gerrit.wikimedia.org/r/493762

Change 496885 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[analytics/refinery@master] WIP: Oozie: add article recommender

https://gerrit.wikimedia.org/r/496885

Change 496885 merged by Ottomata:
[analytics/refinery@master] Oozie: add article recommender

https://gerrit.wikimedia.org/r/496885

Change 503393 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[analytics/refinery@master] Oozie: wait for new Wikidata dumps before generating article recommendations

https://gerrit.wikimedia.org/r/503393

Hi @bmansurov,
I've been monitoring the current run of the recommender (https://yarn.wikimedia.org/proxy/application_1553764233554_69057/), and I think the approach in term of cluster usage is not sustainable. Particularly, generating the pageviews datasets for the top 50 languages should be done in a single pass, and written in partitioned folders (using partitionBy). Currently, there are 50 runs each reading the whole pageview needed data multiple times - This represents reading 50 *2 *1Tb of data. Doing it in one pass will save a lot of IOs as well as a lot of time :)

@JAllemandou, that makes sense. I'm currently measuring the times spent running each operation as part of T220520: Improve article-recommender script, and your suggestion makes sense. The process is almost done, it's been two days, but I think by the end of the day it should be over.

Edit: I see the job got killed. I hope it wasn't because I was using too much resources.

Actually it'll not finish - We just killed it as we need to restart the cluster (planned maintenance - see https://lists.wikimedia.org/pipermail/engineering/2019-April/000695.html). Sorry for that :( Hopefully you'll still have enough logs.

Change 506218 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[analytics/refinery@master] Oozie article recommender: use version 0.0.2

https://gerrit.wikimedia.org/r/506218

Change 503393 merged by Ottomata:
[analytics/refinery@master] Oozie: wait for new Wikidata dumps before generating article recommendations

https://gerrit.wikimedia.org/r/503393

Change 506218 merged by Ottomata:
[analytics/refinery@master] Oozie article recommender: use version 0.0.2

https://gerrit.wikimedia.org/r/506218

@bmansurov, I've merged your changes, but haven't deployed them. They should go out with the next deploy. The oozie jobs will then need to be restarted.

@Ottomata et al: All related patches in Gerrit merged or abandoned. Is there more to do in this task? Or can task status be set to resolved?

@bmansurov: Could you please answer the last comment? Thanks in advance!

I'll call it resolved and if someone disagrees they can reopen it.