10:07 AM this probably sound evil, but is it possible to run PHP code in hadoop cluster? 10:07 AM addshore: I'll fake not having seen that message :D 10:07 AM xD 10:07 AM hear me out! :P https://phabricator.wikimedia.org/T94019 10:08 AM I definitely hear you - this would a hell lot of sense 10:08 AM I was generally thinking that if hadoop had up to date external JSON representations of Wikidata entites, 1) we could generate JSON dumps from there and 2) we could use the same json -> rdf mapping code to make the rdf fumps 10:08 AM while keeping the mapping code in Wikibase in PHP 10:09 AM I'm not sure about how we can get "up to date external JSON representations" - I assume this means getting the wikidata content, right? 10:09 AM yeah 10:09 AM I was thinking, some event post edit -> kafka -> hadoop? but not sure about that part either? 10:09 AM ok - we're not yet ready on that front, btu we have a plan (but no resource to put it in practice) 10:10 AM okay 10:10 AM the plan would be: edit--> kafka --> streaming job --> WMF-API(get content) --> store on HDFS 10:11 AM aaah okay, rather than having the content in the stream 10:11 AM And there is one issue: we have incomplete edits in kafka currently, and a task force issue (we don't have the bandwisth) 10:11 AM aah, this is the "reliable events" ticket right? 10:11 AM correct sir 10:11 AM it always comes back to that one! :P 10:12 AM So https://phabricator.wikimedia.org/T120242 ? 10:12 AM The WDQS-updater relies on the edits stream, so we could do it (they already do almost exactly what I describe), but IMO it's worth waiting the solution to reliable events (or more reliable should I sa) 10:12 AM okay *goes to write a comment* 10:13 AM this is the ticket yes - I think we could even not do what the ticket advertise as possible solutions as long as the missing events in streams is low enough (currentl at ~1%, far too high) 10:15 AM And then about running PHP on hadoop - I have not yet seen it done, but I don't see why it would be feasible - The java (or python) haddop worker starts an external PHP process, feeds it the content and gets the result back, then finalizes its process 10:15 AM not great but doable 10:15 AM It'll also require having the PHP libs etc on hadoop workers (not the case now) 10:15 AM but also with the text diagram you mentioned above, if the content comes from the MW api anyway, then the conversion happens in the app, rather than in haddop land 10:16 AM with #worksforme 10:16 AM *which 10:16 AM hm, which app? 10:16 AM well, mediawiki / wikibase 10:17 AM Ah - We can ask the wikibase-API to give us json is that rght? 10:17 AM Maybe even RDF? 10:17 AM json,. rdf, ttl, jsonld, etc 10:17 AM right 10:18 AM So a dedicated wikibase job extracting info in streaming could do 10:18 AM I think that's what the WDQS-updater does 10:18 AM indeed, we could even provide multiple formats via a single call for optimizations sake 10:18 AM yes, the streaming updater goes to wikibase and gets the TLL afaik 10:18 AM yup, call once, save multiple 10:18 AM *ttl