Page MenuHomePhabricator
Paste P16834

(An Untitled Masterwork)
ActivePublic

Authored by Addshore on Tue, Jul 20, 9:26 AM.
10:07 AM <addshore> this probably sound evil, but is it possible to run PHP code in hadoop cluster?
10:07 AM <joal> addshore: I'll fake not having seen that message :D
10:07 AM <addshore> xD
10:07 AM <addshore> hear me out! :P https://phabricator.wikimedia.org/T94019
10:08 AM <joal> I definitely hear you - this would a hell lot of sense
10:08 AM <addshore> I was generally thinking that if hadoop had up to date external JSON representations of Wikidata entites, 1) we could generate JSON dumps from there and 2) we could use the same json -> rdf mapping code to make the rdf fumps
10:08 AM <addshore> while keeping the mapping code in Wikibase in PHP
10:09 AM <joal> I'm not sure about how we can get "up to date external JSON representations" - I assume this means getting the wikidata content, right?
10:09 AM <addshore> yeah
10:09 AM <addshore> I was thinking, some event post edit -> kafka -> hadoop? but not sure about that part either?
10:09 AM <joal> ok - we're not yet ready on that front, btu we have a plan (but no resource to put it in practice)
10:10 AM <addshore> okay
10:10 AM <joal> the plan would be: edit--> kafka --> streaming job --> WMF-API(get content) --> store on HDFS
10:11 AM <addshore> aaah okay, rather than having the content in the stream
10:11 AM <joal> And there is one issue: we have incomplete edits in kafka currently, and a task force issue (we don't have the bandwisth)
10:11 AM <addshore> aah, this is the "reliable events" ticket right?
10:11 AM <joal> correct sir
10:11 AM <addshore> it always comes back to that one! :P
10:12 AM <addshore> So https://phabricator.wikimedia.org/T120242 ?
10:12 AM <joal> The WDQS-updater relies on the edits stream, so we could do it (they already do almost exactly what I describe), but IMO it's worth waiting the solution to reliable events (or more reliable should I sa)
10:12 AM <addshore> okay *goes to write a comment*
10:13 AM <joal> this is the ticket yes - I think we could even not do what the ticket advertise as possible solutions as long as the missing events in streams is low enough (currentl at ~1%, far too high)
10:15 AM <joal> And then about running PHP on hadoop - I have not yet seen it done, but I don't see why it would be feasible - The java (or python) haddop worker starts an external PHP process, feeds it the content and gets the result back, then finalizes its process
10:15 AM <joal> not great but doable
10:15 AM <joal> It'll also require having the PHP libs etc on hadoop workers (not the case now)
10:15 AM <addshore> but also with the text diagram you mentioned above, if the content comes from the MW api anyway, then the conversion happens in the app, rather than in haddop land
10:16 AM <addshore> with #worksforme
10:16 AM <addshore> *which
10:16 AM <joal> hm, which app?
10:16 AM <addshore> well, mediawiki / wikibase
10:17 AM <joal> Ah - We can ask the wikibase-API to give us json is that rght?
10:17 AM <joal> Maybe even RDF?
10:17 AM <addshore> json,. rdf, ttl, jsonld, etc
10:17 AM <joal> right
10:18 AM <joal> So a dedicated wikibase job extracting info in streaming could do
10:18 AM <joal> I think that's what the WDQS-updater does
10:18 AM <addshore> indeed, we could even provide multiple formats via a single call for optimizations sake
10:18 AM <addshore> yes, the streaming updater goes to wikibase and gets the TLL afaik
10:18 AM <joal> yup, call once, save multiple
10:18 AM <addshore> *ttl