Page MenuHomePhabricator

Support external tabular datasets in WDQS
Open, MediumPublic


What would be the best way to integrate WDQS with the tabular data, as well as other CSV sources? For example, if a large dataset provider publishes CSV or TSV files, and WDQS wants to federate with it, we could do something like this (this example should be modified as we flash out the exact interface)

  # This is a well known public data source for stock quotes - Tesla daily, first lines:
  # Date	Open	High	Low	Close	Volume	Ex-Dividend	Split Ratio	Adj. Open	Adj. High	Adj. Low	Adj. Close	Adj. Volume
  # 2017-11-24	313.79	316.41	311	315.55	3242220	0	1	313.79	316.41	311	315.55	3242220
  SERVICE wikibase:tabular {
    # Data location
    bd:serviceParam wikibase:url <> .

    # CSVFormat constant, e.g. EXCEL, MYSQL, RFC4180, TDF. Default = 'DEFAULT'
    bd:serviceParam wikibase:csvFormat 'DEFAULT' .
    # If true, treat the first row as header. Default - depends on csvFormat
    bd:serviceParam wikibase:firstRowIsHeader true .
    # If true, use tabular:<column_name>, otherwise use tabular:<column_number> (1-based)
    # By default, this value is the same as firstRowIsHeader
    bd:serviceParam wikibase:csvColumnByName true .

    # Parse columns into variables by their name
    ?date    tabular:Date  'date:yyyy-mm-dd' .  # parse as date
    ?dateStr tabular:Close 'string' .           # unparsed date value
    ?close   tabular:Close 'double' .           # parse as double

  # Extract a single date/close value
  FILTER ( ?date = "2017-11-24T00:00:00Z"^^xsd:dateTime )

Expected result:


Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Yurik updated the task description. (Show Details)
Yurik updated the task description. (Show Details)

I don’t think we should allow integration with datasets in arbitrary URLs, for the same reasons that we don’t allow federation to arbitrary SPARQL endpoints (DoSing a website from WMF’s servers, license issues, …). Let’s restrict this to tabular data on Commons. (That also resolves the need for supporting different CSV formats.)

@Lucas_Werkmeister_WMDE I agree - I am planning to implement this feature for both WDQS and Sophox QS. For WDQS, it should only support tabular datasets, or possibly other respected sources.

The first version of this feature has been implemented in Sophox -- see docs. At this point, it supports any GET request that returns CSV-style data (parsable by Java's CSVParser, with many parameters).

If @Smalyshev has any spare time to review the code at , I will try to port it to support .tab pages as well, with a slightly different set of input parameters.

Some thoughts on the implementation:

Parameter matching seems to be backwards:

?url tabular:url 'uri' .
?type tabular:type 'string' .

If I understand it right, 'uri', 'string', 'integer' etc. are types. As such, there's a limited set of them and they should be predicates. While the column names - which are user-defined - should be strings. I.e.:

?url wikibase:csvUri "url" .
?type wikibase:csvString "type" .


Not sure about using wikibase: prefix - this data is not exactly related to wikibase. Maybe some other prefix is better.

About the code:

  • Not sure why it uses MultiSearchIterator - there doesn't seem to be any usage of external bindings? Unless it is allowed to use variable in binding URL? Is that necessary? I would imagine most usage would be with constant URL, am I wrong?
  • Same goes for all query parameters - they are all accept bindings. Do we really need to make them all variable? Is there a use case for it?
  • otQuery probably shouldn't be in generic WDQS code, so if we merge this into WDQS code, we need to see how it can be supported.
  • I think the URL should be checked against whitelist.txt. We can have another whitelist but I think it'd be just complicating the matters.
  • Would be nice to get it as a patch to WDQS in gerrit (sans OSM-specific code)

@Smalyshev the reason I made type as a string is to allow additional parsing parameters, e.g. ?start tabular:startDate 'date:yyyy-mm-dd'

  • MultiSearchIterator and binding params

Correct, most usages would be static, but in theory it might be possible to supply URL or other parsing params depending on some dynamic calculation and other data, right? Its not a must have requirement, but if there are no performance or other major disadvantages, I think it's better to support both?

  • I am not sure what would be the best way to split WDQS and OSM code, yet package it together. Any suggestions?
  • whitelist - agree, if whitelist exists, i think it should be used.

Its not a must have requirement, but if there are no performance or other major disadvantages,

Complexity is one. I'm a big fan of YAGNI :) I'm not 100% against supporting it, but doubtful whether it's actually needed.

I am not sure what would be the best way to split WDQS and OSM code

Since context controls the link between service URI and class, we could have two classes, one implementing base API and one extending it, and have WDQS part only have the base class and OSM part have extended class and put it as URI implementation instead of the base one. Not sure how easy this is to do but it looks possible (without looking deeply into the code).

Smalyshev changed the task status from Open to Stalled.Dec 13 2018, 6:21 PM
Smalyshev triaged this task as Medium priority.

Just a bit of info on why I've nominated this for the 2020 hackthon:
There is currently a major issue with storing statistical data in Wikidata, which would be solved if we could upload the data to Commons as Tabular Data files.
It's proving unsustainable to use qualifiers to break down statistical data by date, let alone using any further breakdown (e.g. gender or age group). Items which need to store such data (e.g. countries) end up far too cluttered with statements so we simply have to choose not to keep historical data or any further breakdown.
We can currently store the tabular data on Commons, but without being able to query it alongside data in Wikidata it has a tiny fraction of the value. Using UNESCO and World Bank data for example, we would be able to generate charts like (Percentage of Primary age girls out of school in Algeria since 1970). Obviously that is just one example in one domain, the possibilities are truly endless!!

@NavinoEvans I agree - feel free to take my implementation (which was already working for any CSV-style inputs), and extend/adapt it. Ideally, it should be merged upstream to the Blazegraph, so it should support any kind of CSVs. It may make sense to have either some sort of a wrapper for the tabular datasets as an extension to Blazegraph, or alternatively to extend the jsonconfig's API to be able to get CSV directly (which might be a better solution, as it would allow other, non-blazegraph usages)

Aklapper changed the task status from Stalled to Open.Nov 8 2020, 2:10 PM

The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status, as tasks should not be stalled (and then potentially forgotten) for years for unclear reasons.

(Smallprint, as general orientation for task management:
If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead.
If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks...Edit Subtasks.
If this task is stalled on an upstream project, then the Upstream tag should be added.
If this task requires info from the task reporter, then there should be instructions which info is needed.
If this task needs retesting, then the TestMe tag should be added.
If this task is out of scope and nobody should ever work on this, or nobody else managed to reproduce the situation described here, then it should have the "Declined" status.
If the task is valid but should not appear on some team's workboard, then the team project tag should be removed while the task has another active project tag.)

So9q added a subscriber: So9q.