It would be nice to easily answer these questions:
- which tools are under the license "WTPL"
- which tools are under the CC Zero
- etc.
So maybe we can create a tool to answer these questions :)
Context / Why
Having a dataset of adopted licenses of Toolforge tools would be useful for multiple reasons, including external research reasons, and internal research to evaluate pathways for the parent-task T152581: Expand the Toolforge definition of "free license" to include FSF-approved and DFSG-compatible licenses (created in December 2016). So to discover if we have more tools with a «technically incompatible» license, like these licenses: https://phabricator.wikimedia.org/T152581#2859128
Again, some licenses are widely considered Open Source Licenses / Free Software Licenses, but some of these are «technically incompatible» with the Wikimedia Cloud ToS (section 4.3). So we are just trying to understand the situation.
The goal is to have data, to understand "potential ToS problems", and evaluate next paths. For example, if we have "many" tools with a «technically incompatible» license, we may want to consider to eventually improve the WMCS ToS to also include FSF-approved licenses (why not - anyway); instead of contacting every individual tool to consider a dual-license (which may be very complicated anyway, if the tool is not single-person-band).
Not just a 1-shoot export
This is probably not a one-shot dataset request. There is creative space to create a tool to generate such export of tools and their licenses.
So, this may be a small tool to access/generate such dataset, so after 1 year somebody can easily get the new data, in a simple way.
Anyway, yes, also a 1-shoot export would be something to start from.
Proposed Dataset: CSV
A simple CSV (comma separated values) generator with at least these columns would make the day:
- tool identifier name (mandatory)
- tool name (for humans) (mandatory)
- tool repository URL (mandatory)
- license name (for humans) (mandatory)
- license identifier (SPDX) (optional)
- last metadata update date (optional)
The tool may have multiple lines if we know the tool has multiple licenses / dual-licenses, etc.
The CSV dataset is good because it's easily usable even if you don't know how to develop software. For example a CSV can be imported into a LibreOffice with few clicks, so to aggregate data and take decisions, which would be more difficult with JSON or Yaml etc.
We can then add other columns in the future.
Proposed Tool: "Toolforge Licenses Catalogue"
I propose to have a tool with an understandable name.
Minimal viable product:
- Access to such dataset (download latest version CSV)
Good to have:
- HTML table to see same results
- minimal filters (at least, filter by some licenses)
- ability to share your filters to others (so the filter is in the query string, e.g. ?licenses=WTFPL,CC0-1.0)
This seems good for an hackathon :)
Code Implementation
Some ideas:
So this could be used to fetch the data "once", and store the complete dataset to the public; and refresh the data later etc.