Page MenuHomePhabricator

Create KNIME nodes to interact with Wikidata
Open, LowestPublic

Description

KNIME is a free, but top, data analytics, reporting and integration platform that "allows assembly of nodes blending different data sources, including preprocessing (ETL: Extraction, Transformation, Loading), for modeling, data analysis and visualization without, or with only minimal, programming".

It would be great to have some read/write KNIME nodes to interact with Wikidata so that average users can use/analyze Wikidata and populate/edit it using visual ETL processes.

Instructions for developers are available here.

Event Timeline

abian created this task.Jun 1 2018, 8:30 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 1 2018, 8:30 PM

Thanks for filing the ticket. I have a hard time understanding this still. What are people doing with KNIME? How would they use Wikidata in it?

abian added a comment.Jun 9 2018, 4:28 PM

Sorry for the lack of clarity. :-/

With KNIME you can design ETL processes that retrieve data from one or more sources (social media, SQL databases, CSV files, XML files, JSON objects, etc.), transform the data (using traditional SQL operators, arithmetical operators, regexes, sortings, filters, type conversions, permutations, custom parsers, AI, statistics, graphs, etc.) and save the resulting data (databases, text files, image files, etc.). These workflows are created through the GUI by just dragging and dropping nodes from your list of available KNIME nodes, interconnecting them and setting some of their individual behaviors (how each node changes its input or inputs).

Once you've prepared a workflow, you can execute it by clicking the run button (or pressing Shift+F7) everytime you want. This means that, unless the data sources drastically change their schemas, you'll always be able to update your results by simply executing your workflow again. Two main uses can be expected from having KNIME nodes for Wikidata:

  • using or analyzing Wikidata and storing the results (locally or in Wikidata) and
  • keeping Wikidata data updated from external sources by executing a workflow that you, or other person, have to design only once.

There's nothing KNIME can do and you can't get with a complete object-oriented programming language and enough free time. But KNIME is data-oriented and much easier to use than any programming language (and my favorite ETL software). :-)

I leave some snapshots of example workflows below. All of them come with KNIME.

Thanks for the clarification!

@abian @Lydia_Pintscher KNIME integrates with R and Java nicely. Given that we already use R extensively to analyse Wikidata, it could be possible to build a set of R developed Wikidata nodes for KNIME, I guess. If it is ETL only that you need, orchestrating SPARQL and API calls from within R and then integrating with KNIME seams feasible (note: it only seams feasible at this point).

However, and given that our resources are pretty constrained:

  • we need to have a clear motivation to do this; how many people would use Wikidata nodes in KNIME; how would the community benefit from this;
  • we need to understand exactly and precisely what functionality we would like to have integrated via KNIME nodes to work with Wikidata.

@abian If you can provide further clarification, the specification of objectives, and motivate would this benefit the community, we might consider pushing this further into the workflow. Thank you!

abian added a comment.EditedJun 11 2018, 6:02 PM

Sadly, I can't guess how many people would finally use these features, this would mainly depend on how much we promoted KNIME and the Wikidata nodes after their possible development. I'm sure I would use these nodes very often if available; :-) however, I'm a regular user of KNIME, and that's not the general rule.

In Wikidata it happens that many people dump data once and don't worry about keeping these data updated. KNIME workflows can contribute to improve this situation, but only if editors want to install and use KNIME with that purpose and if they want to share their workflows under a free license to ensure the continuity of their tasks. The ideal situation would be to have a complete, ad hoc web application to design and share Wikidata workflows, but that would be too resource-consuming, so I see KNIME as the nicest alternative.

About the R integration, I wonder if users would have to install R separately as a requirement to use the Wikidata nodes. If so, this could make the installation tedious and could reduce the impact of this development.

As a first step, I think it would be positive to try contacting KNIME directly and telling them how cool Wikidata is and how useful it would be to integrate KNIME with Wikidata in case they can take care of the development (see also https://www.knime.com/nodeguide/visualization/geolocation/visualization-of-the-world-cities-using-open-street-map-osm). Some contacts that I've just found are Rosaria Silipo (Twitter DMR_Rosaria, email address rosaria dot silipo at knime dot com) and general addresses info at knime dot com and contact at knime dot com.

  • we need to understand exactly and precisely what functionality we would like to have integrated via KNIME nodes to work with Wikidata.

My proposal, feel free to change it if convenient:

  • To read:
    • Submit an arbitrary SPARQL query and return the results as a KNIME table.
    • Given a KNIME table with a column of Wikidata IDs as an input, add one or more new columns with the requested values: the label or the description in a certain language or the truthy values for the requested properties.
  • To write:
    • Given a KNIME table as an input, add the statements in it to Wikidata. Each row is a different statement, and each column is a different part of the statement (the ID, a property/predicate and a value should be mandatory; an arbitrary number of pairs property-value as qualifiers or references should be optional). This format could be very similar to QuickStatements'. With any number of violations of mandatory constraints greater than 0, KNIME should throw an error so that the node can't be executed until there are no violations of mandatory constraints. Ideally, the rest of violations should be marked as KNIME warnings.
    • Extra: The same, but for Lexemes and Forms.
  • To login: Being logged should be a requirement for editing Wikidata from KNIME. Credentials shouldn't be permanently stored in the workflow nor exported.
Vvjjkkii renamed this task from Create KNIME nodes to interact with Wikidata to itbaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from itbaaaaaaa to Create KNIME nodes to interact with Wikidata.Jul 2 2018, 1:30 AM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
abian triaged this task as Lowest priority.Sep 17 2019, 12:31 PM

Hopefully these needs will soon be met by https://etl.linkedpipes.com/ and OpenRefine has also added some features and improved its integration with Wikidata.