Change Details

The categories graph is currently served through wdqs blazegraph cluster via a [[https://www.mediawiki.org/wiki/Wikidata_Query_Service/Categories|SPARQL endpoint]]. As we are trying to find a replacement for blazegraph we should also re-evaluate how this endpoint is being fed and what backend is powering it. = Data pipeline = This is about getting the data out of mediawiki. The current data model is quite simple (graphql schema): ```lang=graphql-schema type Category { """ ID for the category page. (The RDF model conflates the page_url and the ID should we do the same here?) """ id: ID! """ Name of the page """ name: String! """ URL of the category page """ page_url: String! """ Categories this category belongs to """ parentCategories: [Category!]! """ Number of pages belonging to this category (direct relationships) """ numberOfPages: Int! """ Number of categories belonging to this category (direct relationships) """ numberOfCategories: Int! } ``` == Dumps/batch approach ([[https://wikitech.wikimedia.org/wiki/Dumps/CategoriesRDF|current approach]])== - Take the RDF dumps generated by MediaWiki weekly - Apply daily SPARQL updates Pros: - MW code already written - Can use stateless & queryable RDF format such as [[https://www.rdfhdt.org/]] as a transferable blob to the serving layer Cons: - data-pipeline has to be rewritten (it's currently a set of shell scripts) - best update rate: daily == Real-time approach == By getting the data out of mediawiki and writting a stream processor to transform this data into something a graph backend can ingest. Data needed: - category -> category link/unlink (to construct the category graph) - article -> category link/unlink (to get the article counts of direct memberships) [TODO: add a link to a phab ticket related to the links-tables dse-hackathon project, current slack channel [[https://wikimedia.slack.com/archives/C02GA59R99U|#dse-hackathon-links-tables]].] A realtime approach requires a stateful service to hold the graph and apply updates as they come out of the stream processor. = Serving layer = Depending on the approach taken above, the data size but also the kind of services we would like to offer different approaches can be taken: == Immutable RDF store running in k8s == If the data-size is not large (<10Gb so that it can be loaded on the pods temp space). [[http://jena.apache.org/documentation/fuseki2/|Fuzeki]] on top of [[https://www.rdfhdt.org/manual-of-hdt-integration-with-jena/|jena+HDT]] can be used as the engine. Pros: - Still in the RDF space - Stateless Cons: - Limited to the RDF space & clients - Might not work if the datasize is large - Cannot be Real-Time == Mutable store with a property graph == Evaluate engines such as neo4j or dgraph to serve the data. Pros: - Might allow using graphql - Might be more suited for other usecases - Can be realtime Cons: - Out of the RDF space and will require a compatibility layer to avoid breaking existing clients - Possibly difficult deployments: no support for stateful services on our k8s setup TODO: URL for a repo to experiment with these ideas

The categories graph is currently served through wdqs blazegraph cluster via a [[https://www.mediawiki.org/wiki/Wikidata_Query_Service/Categories|SPARQL endpoint]]. As we are trying to find a replacement for blazegraph we should also re-evaluate how this endpoint is being fed and what backend is powering it. = Data pipeline = This is about getting the data out of mediawiki. The current data model is quite simple (graphql schema): ```lang=graphql-schema type Category { """ ID for the category page. (The RDF model conflates the page_url and the ID should we do the same here?) """ id: ID! """ Name of the page """ name: String! """ URL of the category page """ page_url: String! """ Categories this category belongs to """ parentCategories: [Category!]! """ Number of pages belonging to this category (direct relationships) """ numberOfPages: Int! """ Number of categories belonging to this category (direct relationships) """ numberOfCategories: Int! } ``` == Dumps/batch approach ([[https://wikitech.wikimedia.org/wiki/Dumps/CategoriesRDF|current approach]])== - Take the RDF dumps generated by MediaWiki weekly - Apply daily SPARQL updates Pros: - MW code already written - Can use stateless & queryable RDF format such as [[https://www.rdfhdt.org/]] as a transferable blob to the serving layer Cons: - data-pipeline has to be rewritten (it's currently a set of shell scripts) - best update rate: daily == Real-time approach == By getting the data out of mediawiki and writting a stream processor to transform this data into something a graph backend can ingest. Data needed: - category -> category link/unlink (to construct the category graph) - article -> category link/unlink (to get the article counts of direct memberships) [TODO: add a link to a phab ticket related to the links-tables dse-hackathon project, current slack channel [[https://wikimedia.slack.com/archives/C02GA59R99U|#dse-hackathon-links-tables]].] A realtime approach requires a stateful service to hold the graph and apply updates as they come out of the stream processor. = Serving layer = Depending on the approach taken above, the data size but also the kind of services we would like to offer different approaches can be taken: == Immutable RDF store running in k8s == If the data-size is not large (<10Gb so that it can be loaded on the pods temp space). [[http://jena.apache.org/documentation/fuseki2/|Fuzeki]] on top of [[https://www.rdfhdt.org/manual-of-hdt-integration-with-jena/|jena+HDT]] can be used as the engine. Pros: - Still in the RDF space - Stateless Cons: - Limited to the RDF space & clients - Might not work if the datasize is large - Cannot be Real-Time == Mutable store with a property graph == Evaluate engines such as neo4j or dgraph to serve the data. Pros: - Might allow using graphql - Might be more suited for other usecases - Can be realtime Cons: - Out of the RDF space and will require a compatibility layer to avoid breaking existing clients - Possibly difficult deployments: no support for stateful services on our k8s setup Current code-base https://github.com/nomoa/category-graph