[DSE Hackathon 2021] Move the categories graph out of blazegraph
Open, LowPublic
Actions

Assigned To

None

Authored By

	dcausse
	Oct 4 2021, 10:59 AM

Description

This is a DSE Hackathon project.

The categories graph is currently served through wdqs blazegraph cluster via a SPARQL endpoint.
As we are trying to find a replacement for blazegraph we should also re-evaluate how this endpoint is being fed and what backend is powering it.

Data pipeline

This is about getting the data out of mediawiki. The current data model is quite simple (graphql schema):

type Category {
    """ ID for the category page. (The RDF model conflates the page_url and the ID should we do the same here?) """
    id: ID!

    """ Name of the page """
    name: String!

    """ URL of the category page """
    page_url: String!

    """ Categories this category belongs to """
    parentCategories: [Category!]!

    """ Number of pages belonging to this category (direct relationships) """
    numberOfPages: Int!

    """ Number of categories belonging to this category (direct relationships) """
    numberOfCategories: Int!
}

Dumps/batch approach (current approach)

Take the RDF dumps generated by MediaWiki weekly
Apply daily SPARQL updates

Pros:

MW code already written
Can use stateless & queryable RDF format such as https://www.rdfhdt.org/ as a transferable blob to the serving layer

Cons:

data-pipeline has to be rewritten (it's currently a set of shell scripts)
best update rate: daily

Real-time approach

By getting the data out of mediawiki and writting a stream processor to transform this data into something a graph backend can ingest.
Data needed:

category -> category link/unlink (to construct the category graph)
article -> category link/unlink (to get the article counts of direct memberships)

[TODO: add a link to a phab ticket related to the links-tables dse-hackathon project, current slack channel #dse-hackathon-links-tables.]

A realtime approach requires a stateful service to hold the graph and apply updates as they come out of the stream processor.

Serving layer

Depending on the approach taken above, the data size but also the kind of services we would like to offer different approaches can be taken:

Immutable RDF store running in k8s

If the data-size is not large (<10Gb so that it can be loaded on the pods temp space).
Fuzeki on top of jena+HDT can be used as the engine.

Pros:

Still in the RDF space
Stateless

Cons:

Limited to the RDF space & clients
Might not work if the datasize is large
Cannot be Real-Time

Mutable store with a property graph

Evaluate engines such as neo4j or dgraph to serve the data.

Pros:

Might allow using graphql
Might be more suited for other usecases
Can be realtime

Cons:

Out of the RDF space and will require a compatibility layer to avoid breaking existing clients
Possibly difficult deployments: no support for stateful services on our k8s setup

Current code-base https://github.com/nomoa/category-graph

Related Objects

Mentioned Here: T289517: DCAT AP endpoint is down

Event Timeline

dcausse created this task.Oct 4 2021, 10:59 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 4 2021, 10:59 AM

dcausse added a project: Wikidata-Query-Service.Oct 4 2021, 11:00 AM

Maintenance_bot added a project: Wikidata.Oct 4 2021, 11:45 AM

dcausse updated the task description. (Show Details)Oct 4 2021, 6:49 PM

Is this related to T289517?

In T292404#7404318, @MPhamWMF wrote:

Is this related to T289517?

Not directly related as these are two different datasets.
This ticket (depending on the direction it takes) could benefit the DCAT-AP with the idea around the immutable RDF store running in k8s

MPhamWMF renamed this task from Move the categories graph out of blazegraph to [DSE Hackathon 2021] Move the categories graph out of blazegraph.Oct 6 2021, 8:46 PM

MPhamWMF updated the task description. (Show Details)

Armando805ox moved this task from Incoming to Scaling on the Wikidata-Query-Service board.Oct 17 2021, 5:16 PM

dcausse moved this task from Scaling to Incoming on the Wikidata-Query-Service board.Oct 17 2021, 8:36 PM

MPhamWMF triaged this task as Low priority.Oct 18 2021, 3:34 PM

MPhamWMF moved this task from Incoming to WDQS-icebox on the Wikidata-Query-Service board.

[DSE Hackathon 2021] Move the categories graph out of blazegraphOpen, LowPublicActions

Description

Data pipeline

Dumps/batch approach (current approach)

Real-time approach

Serving layer

Immutable RDF store running in k8s

Mutable store with a property graph

Related Objects

Event Timeline

[DSE Hackathon 2021] Move the categories graph out of blazegraph
Open, LowPublic
Actions