Page MenuHomePhabricator

[DSE Hackathon 2021] Move the categories graph out of blazegraph
Open, LowPublic

Description

This is a DSE Hackathon project.

The categories graph is currently served through wdqs blazegraph cluster via a SPARQL endpoint.
As we are trying to find a replacement for blazegraph we should also re-evaluate how this endpoint is being fed and what backend is powering it.

Data pipeline

This is about getting the data out of mediawiki. The current data model is quite simple (graphql schema):

type Category {
    """ ID for the category page. (The RDF model conflates the page_url and the ID should we do the same here?) """
    id: ID!

    """ Name of the page """
    name: String!

    """ URL of the category page """
    page_url: String!

    """ Categories this category belongs to """
    parentCategories: [Category!]!

    """ Number of pages belonging to this category (direct relationships) """
    numberOfPages: Int!

    """ Number of categories belonging to this category (direct relationships) """
    numberOfCategories: Int!
}

Dumps/batch approach (current approach)

  • Take the RDF dumps generated by MediaWiki weekly
  • Apply daily SPARQL updates

Pros:

  • MW code already written
  • Can use stateless & queryable RDF format such as https://www.rdfhdt.org/ as a transferable blob to the serving layer

Cons:

  • data-pipeline has to be rewritten (it's currently a set of shell scripts)
  • best update rate: daily

Real-time approach

By getting the data out of mediawiki and writting a stream processor to transform this data into something a graph backend can ingest.
Data needed:

  • category -> category link/unlink (to construct the category graph)
  • article -> category link/unlink (to get the article counts of direct memberships)

[TODO: add a link to a phab ticket related to the links-tables dse-hackathon project, current slack channel #dse-hackathon-links-tables.]

A realtime approach requires a stateful service to hold the graph and apply updates as they come out of the stream processor.

Serving layer

Depending on the approach taken above, the data size but also the kind of services we would like to offer different approaches can be taken:

Immutable RDF store running in k8s

If the data-size is not large (<10Gb so that it can be loaded on the pods temp space).
Fuzeki on top of jena+HDT can be used as the engine.

Pros:

  • Still in the RDF space
  • Stateless

Cons:

  • Limited to the RDF space & clients
  • Might not work if the datasize is large
  • Cannot be Real-Time

Mutable store with a property graph

Evaluate engines such as neo4j or dgraph to serve the data.

Pros:

  • Might allow using graphql
  • Might be more suited for other usecases
  • Can be realtime

Cons:

  • Out of the RDF space and will require a compatibility layer to avoid breaking existing clients
  • Possibly difficult deployments: no support for stateful services on our k8s setup

Current code-base https://github.com/nomoa/category-graph

Related Objects

Event Timeline

Is this related to T289517?

Not directly related as these are two different datasets.
This ticket (depending on the direction it takes) could benefit the DCAT-AP with the idea around the immutable RDF store running in k8s

MPhamWMF renamed this task from Move the categories graph out of blazegraph to [DSE Hackathon 2021] Move the categories graph out of blazegraph.Oct 6 2021, 8:46 PM
MPhamWMF updated the task description. (Show Details)
MPhamWMF moved this task from Incoming to WDQS-icebox on the Wikidata-Query-Service board.