Page MenuHomePhabricator

Support external tabular datasets in WDQS
Open, MediumPublic

Description

What would be the best way to integrate WDQS with the tabular data, as well as other CSV sources? For example, if a large dataset provider publishes CSV or TSV files, and WDQS wants to federate with it, we could do something like this (this example should be modified as we flash out the exact interface)

SELECT * WHERE {
  # This is a well known public data source for stock quotes - Tesla daily, first lines:
  # Date	Open	High	Low	Close	Volume	Ex-Dividend	Split Ratio	Adj. Open	Adj. High	Adj. Low	Adj. Close	Adj. Volume
  # 2017-11-24	313.79	316.41	311	315.55	3242220	0	1	313.79	316.41	311	315.55	3242220
  SERVICE wikibase:tabular {
    # Data location
    bd:serviceParam wikibase:url <https://www.quandl.com/api/v3/datasets/WIKI/TSLA.csv> .

    # CSVFormat constant, e.g. EXCEL, MYSQL, RFC4180, TDF. Default = 'DEFAULT'
    bd:serviceParam wikibase:csvFormat 'DEFAULT' .
    # If true, treat the first row as header. Default - depends on csvFormat
    bd:serviceParam wikibase:firstRowIsHeader true .
    # If true, use tabular:<column_name>, otherwise use tabular:<column_number> (1-based)
    # By default, this value is the same as firstRowIsHeader
    bd:serviceParam wikibase:csvColumnByName true .

    # Parse columns into variables by their name
    ?date    tabular:Date  'date:yyyy-mm-dd' .  # parse as date
    ?dateStr tabular:Close 'string' .           # unparsed date value
    ?close   tabular:Close 'double' .           # parse as double
  }

  # Extract a single date/close value
  FILTER ( ?date = "2017-11-24T00:00:00Z"^^xsd:dateTime )
}

Expected result:

?date?dateStr?close
2017-11-242017-11-24315.55

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Yurik updated the task description. (Show Details)
Yurik updated the task description. (Show Details)

I don’t think we should allow integration with datasets in arbitrary URLs, for the same reasons that we don’t allow federation to arbitrary SPARQL endpoints (DoSing a website from WMF’s servers, license issues, …). Let’s restrict this to tabular data on Commons. (That also resolves the need for supporting different CSV formats.)

@Lucas_Werkmeister_WMDE I agree - I am planning to implement this feature for both WDQS and Sophox QS. For WDQS, it should only support tabular datasets, or possibly other respected sources.

The first version of this feature has been implemented in Sophox -- see docs. At this point, it supports any GET request that returns CSV-style data (parsable by Java's CSVParser, with many parameters).

If @Smalyshev has any spare time to review the code at https://github.com/nyurik/wikidata-query-rdf/tree/tabular , I will try to port it to support .tab pages as well, with a slightly different set of input parameters.

Some thoughts on the implementation:

Parameter matching seems to be backwards:

?url tabular:url 'uri' .
?type tabular:type 'string' .

If I understand it right, 'uri', 'string', 'integer' etc. are types. As such, there's a limited set of them and they should be predicates. While the column names - which are user-defined - should be strings. I.e.:

?url wikibase:csvUri "url" .
?type wikibase:csvString "type" .

etc.

Not sure about using wikibase: prefix - this data is not exactly related to wikibase. Maybe some other prefix is better.

About the code:

  • Not sure why it uses MultiSearchIterator - there doesn't seem to be any usage of external bindings? Unless it is allowed to use variable in binding URL? Is that necessary? I would imagine most usage would be with constant URL, am I wrong?
  • Same goes for all query parameters - they are all accept bindings. Do we really need to make them all variable? Is there a use case for it?
  • otQuery probably shouldn't be in generic WDQS code, so if we merge this into WDQS code, we need to see how it can be supported.
  • I think the URL should be checked against whitelist.txt. We can have another whitelist but I think it'd be just complicating the matters.
  • Would be nice to get it as a patch to WDQS in gerrit (sans OSM-specific code)

@Smalyshev the reason I made type as a string is to allow additional parsing parameters, e.g. ?start tabular:startDate 'date:yyyy-mm-dd'

  • MultiSearchIterator and binding params

Correct, most usages would be static, but in theory it might be possible to supply URL or other parsing params depending on some dynamic calculation and other data, right? Its not a must have requirement, but if there are no performance or other major disadvantages, I think it's better to support both?

  • I am not sure what would be the best way to split WDQS and OSM code, yet package it together. Any suggestions?
  • whitelist - agree, if whitelist exists, i think it should be used.

Its not a must have requirement, but if there are no performance or other major disadvantages,

Complexity is one. I'm a big fan of YAGNI :) I'm not 100% against supporting it, but doubtful whether it's actually needed.

I am not sure what would be the best way to split WDQS and OSM code

Since context controls the link between service URI and class, we could have two classes, one implementing base API and one extending it, and have WDQS part only have the base class and OSM part have extended class and put it as URI implementation instead of the base one. Not sure how easy this is to do but it looks possible (without looking deeply into the code).

Smalyshev changed the task status from Open to Stalled.Dec 13 2018, 6:21 PM
Smalyshev triaged this task as Medium priority.

Just a bit of info on why I've nominated this for the 2020 hackthon:
There is currently a major issue with storing statistical data in Wikidata, which would be solved if we could upload the data to Commons as Tabular Data files.
It's proving unsustainable to use qualifiers to break down statistical data by date, let alone using any further breakdown (e.g. gender or age group). Items which need to store such data (e.g. countries) end up far too cluttered with statements so we simply have to choose not to keep historical data or any further breakdown.
We can currently store the tabular data on Commons, but without being able to query it alongside data in Wikidata it has a tiny fraction of the value. Using UNESCO and World Bank data for example, we would be able to generate charts like (Percentage of Primary age girls out of school in Algeria since 1970). Obviously that is just one example in one domain, the possibilities are truly endless!!

@NavinoEvans I agree - feel free to take my implementation (which was already working for any CSV-style inputs), and extend/adapt it. Ideally, it should be merged upstream to the Blazegraph, so it should support any kind of CSVs. It may make sense to have either some sort of a wrapper for the tabular datasets as an extension to Blazegraph, or alternatively to extend the jsonconfig's API to be able to get CSV directly (which might be a better solution, as it would allow other, non-blazegraph usages)

Aklapper changed the task status from Stalled to Open.Nov 8 2020, 2:10 PM

The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status, as tasks should not be stalled (and then potentially forgotten) for years for unclear reasons.

(Smallprint, as general orientation for task management:
If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead.
If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks...Edit Subtasks.
If this task is stalled on an upstream project, then the Upstream tag should be added.
If this task requires info from the task reporter, then there should be instructions which info is needed.
If this task needs retesting, then the TestMe tag should be added.
If this task is out of scope and nobody should ever work on this, or nobody else managed to reproduce the situation described here, then it should have the "Declined" status.
If the task is valid but should not appear on some team's workboard, then the team project tag should be removed while the task has another active project tag.)

So9q added a subscriber: So9q.

I don’t think we should allow integration with datasets in arbitrary URLs

Since it is probably safer to have another service (one that individuals can deploy for their own needs) access the tabular data and then federate with WDQS here is an example of just that:

https://github.com/justin2004/weblog/tree/master/blend_google_sheet_with_wikidata

More and more people is working on data visualizations, and lots of data scientists would benefit from this. Currently, we can upload a relatively large .tab data file to Commons and we can query for items who are producing this data, but we can't merge both.

This project by Wikipedia Canada is a good example: https://meta.wikimedia.org/wiki/Projet_ECCC,_100_ans_de_donn%C3%A9es_m%C3%A9t%C3%A9orologiques_en_acc%C3%A8s_libre/en. They have uploaded all the weather data from all the meteorological stations, and we can search things like: meteo stations that are open and are less than 10 km. from an airport, but we can add the data from the .tab to this query.

Global warming, demographic changes or pandemic related issues would be better served with this feature.

What is blocking now previously cited solutions (@Justin0x2004 has one) go into WDQS? Thanks!

Note from Wikidata Data Re-use Days: @Mike_Peel mentioned that some Items are huge (e.g. 4.3MB for Q87483673). This is problematic! @Mahir256 noted that this task might be a solution.

Note from Wikidata Data Re-use Days: @Mike_Peel Q says some Items are huge (e.g. 4.3MB for Q87483673). This is problematic! @Mahir256 noted that this task might be a solution.

Q87483673 is a great example of COVID-19 case data that currently isn’t a good fit for Wikidata statements. By comparison, these nine tables have been updating daily since the start of the pandemic. It’s far easier to update these tables than the corresponding Wikidata items. Wikidata prefers keeping outdated statements, but backdated numbers routinely get revised (months into the past), so handling this kind of data properly requires more than adding a statement every day. Unfortunately, the lack of querying functionality has limited the usefulness of these tables – basically, they’re only used for visualizations on Wikipedia that focus on a single table at a time.

The wikidata:tabular service mentioned in https://phabricator.wikimedia.org/T181319#3860648 sounds like it would be a great starting point. It was live on Sophox for a while but got disabled in 2018 because of divergence from the main codebase. Both CSV and tabular data JSON would be exciting to have access to in SPARQL queries.

Just to comment that the example came from https://www.wikidata.org/wiki/Special:LongPages - you can see the first page of these is either COVID or journal article items. It maybe a case of changing the approach of an editor or two on-wiki to avoid these situations arising, more than it requires software changes, though.

Just to comment that the example came from https://www.wikidata.org/wiki/Special:LongPages - you can see the first page of these is either COVID or journal article items. It maybe a case of changing the approach of an editor or two on-wiki to avoid these situations arising, more than it requires software changes, though.

That feels intuitive, but it's not a single digit numbers of editors that want to add large amounts of tabular style data to items. This is a recurring issue coming from many editors in the community, all trying to solve some real use case (sometimes only to be able to query for it, sometimes to be able to display it in some fashion on Wikipedia).

For COVID items, a workaround is spliting them to yearly items.