@Marostegui thanks, your last suggestion is captured a T211980: 'morelike' recommendation API: Bulk import data to MySQL in chunks.
Update: the patches have been merged yesterday. We just need to close the open conversation about other remaining items.
how do handle deleting data in your storage when you have reached capacity or when that dataset is bad? There must be a daemon that takes care of that work right ?
Many ways, systemd daemon, cron job, (maybe) Oozie, or something else. Without knowing how to execute these scripts, I cannot tell you how. It will all become apparent once we agree on an approach in this task.
Rollback is already taken care of the in the script level. We'll have different versions of the data in MySQL and can rollback anytime we want. No need to bother Hadoop, mounted partitions, or Oozie. Please refer to the Pipeline documentation. If you have concerns about it I can work on improving it, but this concern has already been addressed there.
@Ottomata thanks! I've updated the task description and ping the groups you mentioned.
@DarTar The patches are up for a review (I pinged @EBernhardson too). I may need your help to expedite this. Once they are reviewed, we probably need to deploy to the beta cluster in order to check that we don't have any regressions (this step can be skipped if we don't have time — I can check data locally, but it won't be as comprehensive). We are also talking about some issues (in the Google document and comments here) from the previous round. Hopefully they should be resolved soon too.
@RyanSteinberg re: *citation_in_text_refs*,
@RyanSteinberg, regarding *freely_accessible*, I've submitted a patch to fix the issue. Apparently template styles have changed, so I had to adapt the code. As for identifying the total number of freely available resources, I'm not sure what the best approach is. One approach is to parse Wikipedia dumps and look for this information.
Wed, Jan 16
Adding Analytics to give a heads up that EL will get busy when we deploy this.
Added a link to EventLogging. Didn't add a section because content would be obsolete with changes in EventLogging if we don't try and duplicate the content on the onboarding page.
@Miriam any other issues we need to tackle before collecting more data?
@Nuria, Oozie task is happening in parallel here: T210844: Generate article recommendations in Hadoop for use in production.
Tue, Jan 15
Mon, Jan 14
Fri, Jan 11
@hashar can you pelase review https://gerrit.wikimedia.org/r/c/integration/config/+/483225
Thu, Jan 10
\o/, thanks @phuedx!
@Jdlrobson thanks for spotting this. I wonder what a proper fix would be. There's got to be a way of removing that config. Hopefully the fix reduces the errors.