The categories graph is currently served through wdqs blazegraph cluster via a [[https://www.mediawiki.org/wiki/Wikidata_Query_Service/Categories|SPARQL endpoint]].
As we are trying to find a replacement for blazegraph we should also re-evaluate how this endpoint is being fed and what backend is powering it.
= Data pipeline =
This is about getting the data out of mediawiki. The current data model is quite simple (graphql schema):
```lang=graphql-schema
type Category {
""" ID for the category page. (The RDF model conflates the page_url and the ID should we do the same here?) """
id: ID!
""" Name of the page """
name: String!
""" URL of the category page """
page_url: String!
""" Categories this category belongs to """
parentCategories: [Category!]!
""" Number of pages belonging to this category (direct relationships) """
numberOfPages: Int!
""" Number of categories belonging to this category (direct relationships) """
numberOfCategories: Int!
}
```
== Dumps/batch approach ([[https://wikitech.wikimedia.org/wiki/Dumps/CategoriesRDF|current approach]])==
- Take the RDF dumps generated by MediaWiki weekly
- Apply daily SPARQL updates
Pros:
- MW code already written
- Can use stateless & queryable RDF format such as [[https://www.rdfhdt.org/]] as a transferable blob to the serving layer
Cons:
- data-pipeline has to be rewritten (it's currently a set of shell scripts)
- best update rate: daily
== Real-time approach ==
By getting the data out of mediawiki and writting a stream processor to transform this data into something a graph backend can ingest.
Data needed:
- category -> category link/unlink (to construct the category graph)
- article -> category link/unlink (to get the article counts of direct memberships)
[TODO: add a link to a phab ticket related to the links-tables dse-hackathon project, current slack channel [[https://wikimedia.slack.com/archives/C02GA59R99U|#dse-hackathon-links-tables]].]
A realtime approach requires a stateful service to hold the graph and apply updates as they come out of the stream processor.
= Serving layer =
Depending on the approach taken above, the data size but also the kind of services we would like to offer different approaches can be taken:
== Immutable RDF store running in k8s ==
If the data-size is not large (<10Gb so that it can be loaded on the pods temp space).
[[http://jena.apache.org/documentation/fuseki2/|Fuzeki]] on top of [[https://www.rdfhdt.org/manual-of-hdt-integration-with-jena/|jena+HDT]] can be used as the engine.
Pros:
- Still in the RDF space
- Stateless
Cons:
- Limited to the RDF space & clients
- Might not work if the datasize is large
- Cannot be Real-Time
== Mutable store with a property graph ==
Evaluate engines such as neo4j or dgraph to serve the data.
Pros:
- Might allow using graphql
- Might be more suited for other usecases
- Can be realtime
Cons:
- Out of the RDF space and will require a compatibility layer to avoid breaking existing clients
- Possibly difficult deployments: no support for stateful services on our k8s setup
TODO: URL for a repo to experiment with these ideas