Page MenuHomePhabricator

Consider moving WDQS "munging" of RDF into Wikibase RDF output code
Open, Needs TriagePublic

Description

Right now this is just an idea to be discussed, and we should collect reasons for and against

Event Timeline

Addshore renamed this task from Moved WDQS "munging" of RDF into WIkibase RDF output code to Move WDQS "munging" of RDF into WIkibase RDF output code.Jul 23 2021, 7:55 AM

First little chat on the topic

8:58 AM <addshore> Hi all, I just wrote https://phabricator.wikimedia.org/T287231 for moving WDQS munging into wikibase itself, and general RDF output flavours.
8:58 AM <addshore> let me know if you have thoughts on this
9:04 AM <+zpapierski> not a fan, honestly :)
9:04 AM <addshore> Would love to collect reasons :)
9:05 AM <+zpapierski> I have two main reasons for that
9:05 AM <+zpapierski> 1) Munging is the process that's tied to the WDQS itself - basically it's the "canonical" model for Blazegraph
9:06 AM <+zpapierski> "canonical" is quoted, because of course it's hard to talk about actual model when it comes to RDF :)
9:07 AM <+zpapierski> which means that pushing it into wikibase makes it harder to modify, in times when we'll be actively pursuing new soutions for the query engine itself
9:07 AM <addshore> Is it also useful for other triple stores, for the same reason that it is useful for blazegraph?
9:09 AM <addshore> I can imagine a world in which wdqs itself continues to keep "munging" code in its own code base for further rapid changing of the wdqs specific flavour
9:09 AM <+zpapierski> 2) It makes wikibase more to the monolithic architecture. Munging is a thing that needs to happen on the input date anyway, whether inside wikibase, or on the outside of it, but pushing it into wikibase automatically makes anything dependent on munging more coupled to it
9:09 AM <addshore> but also then the "stable" version of that flavour is upstreamed to use by others
9:09 AM <+zpapierski> I don't disagree with your argument
9:09 AM <+zpapierski> but I think we can achieve the same thing without pushing this into wikibase
9:10 AM <+zpapierski> if you think about it that way, sure
9:10 AM <addshore> ooooh, interesting, I'd love to hear more about that and how you think that could work well!
9:10 AM <+zpapierski> I'm not sure how many use cases theere would be outside self deployments of wdqs though
9:10 AM <addshore> the only way I can think if a microservice & also seperated script for munging dumps etc
9:10 AM <addshore> right now in the wild i think there are the same number of wdqs instances as there are in production
9:11 AM <addshore> and the number of people trying the same but with different backends continues to increase
9:11 AM <+zpapierski> but you're already hearing about it all the time - streaming updater will produce a public stream of events, anybody can subscribe to munged triples
9:11 AM <+zpapierski> that last one is interesting - you know more about it?
9:11 AM <+zpapierski> maybe someone already fixed some of our issues :) ?
9:11 AM <addshore> In theroy, how easy will the streaming updater be for non wmf folks to run?
9:12 AM <addshore> Yeah, unfortunatly i dont have many details of success, just normaly folk running into issues :P
9:12 AM <+zpapierski> tough question, it requires elemental knowledge of Flink
9:13 AM <+zpapierski> but in most cases, they won't need to, stream will be public
9:13 AM <+zpapierski> (at least that's the idea)
9:13 AM <addshore> well, not for non wikidata wikibases, we still need to cover the updating case there
9:13 AM <addshore> and the question is, is a whole streaming updater the right solution for a wikibase that gets 10 edits a day
9:14 AM <+zpapierski> in any case, there needs to be an easy solution - WDQS will be coupled basically the updater - that's a good point
9:14 AM <+zpapierski> sure - but if somebody picks up WDQS for their small wikibase, how will you guarantee the compatibility of wikibase munging vs streaming updater processing?
9:15 AM <+zpapierski> like I said, imho we still need to make streaming updater easily deployable
9:15 AM <addshore> thats a good question, but putting the munging in wikibase avoids having it in 2 places I figure
9:15 AM <addshore> If that is the path that we have to take and streaming updater is not ideal for small wikibases
9:15 AM <+zpapierski> I see where you come from, I really do
9:16 AM <+zpapierski> it's just feel like strong coupling in times where it will hurt us with blazegraph transition
9:16 AM <addshore> I'm going to copy this chat log into a comment in the ticket :) I think this was already a good chat, there is no real rush on this at all, just keeping it in thought etc!
9:16 AM <+zpapierski> ok :)

Addshore renamed this task from Move WDQS "munging" of RDF into WIkibase RDF output code to Consider moving WDQS "munging" of RDF into WIkibase RDF output code.Jul 23 2021, 8:32 AM
TJones renamed this task from Consider moving WDQS "munging" of RDF into WIkibase RDF output code to Consider moving WDQS "munging" of RDF into Wikibase RDF output code.Jul 26 2021, 3:25 PM

Cross posting from T245542: Make rdfDump.php export in WDQS-ready format

This is probably strongly related to T287231: Consider moving WDQS "munging" of RDF into Wikibase RDF output code
If the desired output for wdqs was held within Wikibase or a PHP extension, then:

  1. dumpRdf could dump data in the format expected by wdqs, removing the need for munging
  2. Special:EntityData could serve the right format, avoiding the need for munging during updates
  3. The same RDF could more easily be pushed into other stores (not blazegraph)
  4. Other update mechanisms (such as the use of MediaWikiJobs become more realistic for Wikibase for 3rd parties)

I would like to understand the reasons why there are two versions of Wikibase RDF, one that is for example returned by requesting it for an item such as https://artbase.rhizome.org/wiki/Special:EntityData/Q2585.ttl and one that is first munged and then handed over into the triple store. Given the architecture of WIkibase, the RDF solely exists to enable SPARQL querying and federation, it is not the actual source of truth or canonical storage format, but a derivative of the JSON blobs stored in MySQL. How would already exporting the desired RDF format increase coupling, rather than removing an unnecessary step?