Page MenuHomePhabricator

Export a dataset of licenses of Toolforge tools (Toolforge Licenses Catalogue)
Open, Needs TriagePublic

Description

It would be nice to easily answer these questions:

  • which tools are under the license "WTPL"
  • which tools are under the CC Zero
  • etc.

So maybe we can create a tool to answer these questions :)

Context / Why

Having a dataset of adopted licenses of Toolforge tools would be useful for multiple reasons, including external research reasons, and internal research to evaluate pathways for the parent-task T152581: Expand the Toolforge definition of "free license" to include FSF-approved and DFSG-compatible licenses (created in December 2016). So to discover if we have more tools with a «technically incompatible» license, like these licenses: https://phabricator.wikimedia.org/T152581#2859128

Again, some licenses are widely considered Open Source Licenses / Free Software Licenses, but some of these are «technically incompatible» with the Wikimedia Cloud ToS (section 4.3). So we are just trying to understand the situation.

The goal is to have data, to understand "potential ToS problems", and evaluate next paths. For example, if we have "many" tools with a «technically incompatible» license, we may want to consider to eventually improve the WMCS ToS to also include FSF-approved licenses (why not - anyway); instead of contacting every individual tool to consider a dual-license (which may be very complicated anyway, if the tool is not single-person-band).

Not just a 1-shoot export

This is probably not a one-shot dataset request. There is creative space to create a tool to generate such export of tools and their licenses.

So, this may be a small tool to access/generate such dataset, so after 1 year somebody can easily get the new data, in a simple way.

Anyway, yes, also a 1-shoot export would be something to start from.

Proposed Dataset: CSV

A simple CSV (comma separated values) generator with at least these columns would make the day:

  • tool identifier name (mandatory)
  • tool name (for humans) (mandatory)
  • tool repository URL (mandatory)
  • license name (for humans) (mandatory)
  • license identifier (SPDX) (optional)
  • last metadata update date (optional)

The tool may have multiple lines if we know the tool has multiple licenses / dual-licenses, etc.

The CSV dataset is good because it's easily usable even if you don't know how to develop software. For example a CSV can be imported into a LibreOffice with few clicks, so to aggregate data and take decisions, which would be more difficult with JSON or Yaml etc.

We can then add other columns in the future.

Proposed Tool: "Toolforge Licenses Catalogue"

I propose to have a tool with an understandable name.

Minimal viable product:

  • Access to such dataset (download latest version CSV)

Good to have:

  • HTML table to see same results
  • minimal filters (at least, filter by some licenses)
  • ability to share your filters to others (so the filter is in the query string, e.g. ?licenses=WTFPL,CC0-1.0)

This seems good for an hackathon :)

Code Implementation

Some ideas:

The Toolhub search API is probably our best sourcing for this kind of information. It is not going to be anywhere near perfect, but the toolinfo records from Striker end up there along with more self-reported and community sourced data. https://toolhub.wikimedia.org/api-docs#get-/api/search/tools/

So this could be used to fetch the data "once", and store the complete dataset to the public; and refresh the data later etc.

Event Timeline

Also refer to https://www.wikidata.org/wiki/Wikidata:List_of_Wikimedia_tools_with_Wikidata_item
Licences, etc., are all structured data that can be easily integrated into the tools' Wikidata items.

Also refer to https://www.wikidata.org/wiki/Wikidata:List_of_Wikimedia_tools_with_Wikidata_item
Licences, etc., are all structured data that can be easily integrated into the tools' Wikidata items.

Thanks - and I'm now curious about:

  • how many tools are not in Wikidata
  • and if Toolforge tools are in scope anyway 🤔 - maybe a question for the Wikidata bar

I mean, blindly reading this, a Toolforge tool is not automatically in scope (?)

https://www.wikidata.org/wiki/Wikidata:Notability

If you want to share that question to the Wikidata bar, thanks for linking since I'm curious, if a mass-import would be appreciated. Just to open another task :P lol

  • and if Toolforge tools are in scope anyway 🤔 - maybe a question for the Wikidata bar

Harej asked Wikidata about making it the backend for Toolhub and was told the data was not a good fit. That's what led to my custom django backend. ¯\_(ツ)_/¯