Page MenuHomePhabricator

Support external tabular datasets in WDQS
Open, Stalled, MediumPublic

Description

What would be the best way to integrate WDQS with the tabular data, as well as other CSV sources? For example, if a large dataset provider publishes CSV or TSV files, and WDQS wants to federate with it, we could do something like this (this example should be modified as we flash out the exact interface)

SELECT * WHERE {
  # This is a well known public data source for stock quotes - Tesla daily, first lines:
  # Date	Open	High	Low	Close	Volume	Ex-Dividend	Split Ratio	Adj. Open	Adj. High	Adj. Low	Adj. Close	Adj. Volume
  # 2017-11-24	313.79	316.41	311	315.55	3242220	0	1	313.79	316.41	311	315.55	3242220
  SERVICE wikibase:tabular {
    # Data location
    bd:serviceParam wikibase:url <https://www.quandl.com/api/v3/datasets/WIKI/TSLA.csv> .

    # CSVFormat constant, e.g. EXCEL, MYSQL, RFC4180, TDF. Default = 'DEFAULT'
    bd:serviceParam wikibase:csvFormat 'DEFAULT' .
    # If true, treat the first row as header. Default - depends on csvFormat
    bd:serviceParam wikibase:firstRowIsHeader true .
    # If true, use tabular:<column_name>, otherwise use tabular:<column_number> (1-based)
    # By default, this value is the same as firstRowIsHeader
    bd:serviceParam wikibase:csvColumnByName true .

    # Parse columns into variables by their name
    ?date    tabular:Date  'date:yyyy-mm-dd' .  # parse as date
    ?dateStr tabular:Close 'string' .           # unparsed date value
    ?close   tabular:Close 'double' .           # parse as double
  }

  # Extract a single date/close value
  FILTER ( ?date = "2017-11-24T00:00:00Z"^^xsd:dateTime )
}

Expected result:

?date?dateStr?close
2017-11-242017-11-24315.55

Event Timeline

Yurik created this task.Nov 25 2017, 12:18 AM
Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptNov 25 2017, 12:18 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Yurik updated the task description. (Show Details)Nov 25 2017, 12:20 AM
Yurik updated the task description. (Show Details)
Yurik updated the task description. (Show Details)Nov 28 2017, 2:10 AM
Yurik updated the task description. (Show Details)
Yurik updated the task description. (Show Details)Dec 18 2017, 3:08 AM

I don’t think we should allow integration with datasets in arbitrary URLs, for the same reasons that we don’t allow federation to arbitrary SPARQL endpoints (DoSing a website from WMF’s servers, license issues, …). Let’s restrict this to tabular data on Commons. (That also resolves the need for supporting different CSV formats.)

@Lucas_Werkmeister_WMDE I agree - I am planning to implement this feature for both WDQS and Sophox QS. For WDQS, it should only support tabular datasets, or possibly other respected sources.

Yurik added a comment.Dec 26 2017, 4:10 AM

The first version of this feature has been implemented in Sophox -- see docs. At this point, it supports any GET request that returns CSV-style data (parsable by Java's CSVParser, with many parameters).

If @Smalyshev has any spare time to review the code at https://github.com/nyurik/wikidata-query-rdf/tree/tabular , I will try to port it to support .tab pages as well, with a slightly different set of input parameters.

Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.Jan 30 2018, 11:12 PM

Some thoughts on the implementation:

Parameter matching seems to be backwards:

?url tabular:url 'uri' .
?type tabular:type 'string' .

If I understand it right, 'uri', 'string', 'integer' etc. are types. As such, there's a limited set of them and they should be predicates. While the column names - which are user-defined - should be strings. I.e.:

?url wikibase:csvUri "url" .
?type wikibase:csvString "type" .

etc.

Not sure about using wikibase: prefix - this data is not exactly related to wikibase. Maybe some other prefix is better.

About the code:

  • Not sure why it uses MultiSearchIterator - there doesn't seem to be any usage of external bindings? Unless it is allowed to use variable in binding URL? Is that necessary? I would imagine most usage would be with constant URL, am I wrong?
  • Same goes for all query parameters - they are all accept bindings. Do we really need to make them all variable? Is there a use case for it?
  • otQuery probably shouldn't be in generic WDQS code, so if we merge this into WDQS code, we need to see how it can be supported.
  • I think the URL should be checked against whitelist.txt. We can have another whitelist but I think it'd be just complicating the matters.
  • Would be nice to get it as a patch to WDQS in gerrit (sans OSM-specific code)
Yurik added a comment.Feb 28 2018, 9:40 PM

@Smalyshev the reason I made type as a string is to allow additional parsing parameters, e.g. ?start tabular:startDate 'date:yyyy-mm-dd'

  • MultiSearchIterator and binding params

Correct, most usages would be static, but in theory it might be possible to supply URL or other parsing params depending on some dynamic calculation and other data, right? Its not a must have requirement, but if there are no performance or other major disadvantages, I think it's better to support both?

  • I am not sure what would be the best way to split WDQS and OSM code, yet package it together. Any suggestions?
  • whitelist - agree, if whitelist exists, i think it should be used.

Its not a must have requirement, but if there are no performance or other major disadvantages,

Complexity is one. I'm a big fan of YAGNI :) I'm not 100% against supporting it, but doubtful whether it's actually needed.

I am not sure what would be the best way to split WDQS and OSM code

Since context controls the link between service URI and class, we could have two classes, one implementing base API and one extending it, and have WDQS part only have the base class and OSM part have extended class and put it as URI implementation instead of the base one. Not sure how easy this is to do but it looks possible (without looking deeply into the code).

Smalyshev changed the task status from Open to Stalled.Dec 13 2018, 6:21 PM
Smalyshev triaged this task as Medium priority.