Page MenuHomePhabricator

Toolhub has duplicate identifiers for GPL family of licenses
Closed, ResolvedPublicBUG REPORT

Description

Context

Screenshot 2021-11-21 at 17-22-30 Add or remove tools Toolhub.png (1×3 px, 258 KB)

Toolhub has both the deprecated GPL-3.0 and GPL-3.0+ identifiers plus the new, preferred identifiers of GPL-3.0-only and GPL-3.0-and-later. Preferably it would accept both as valid identifiers, but canonicalize/map the old ones to the new ones.

User Story

As an Toolhub user adding or editing a tool entry,
I want to be only view active licenses
so that I don't select deprecated licenses

Acceptance Criteria
  • Deprecated licenses should not appear in the "License" drop-down list

Event Timeline

Toolhub uses https://pypi.org/project/spdx-license-list/ to validate SPDX identifiers. That data collection does have a isDeprecatedLicenseId attribute for each identifier to describe if it is considered deprecated or not. It does not seem to have a pointer to a replacement identifier however:

>>> spdx_license_list.LICENSES["GPL-3.0+"]
{'isDeprecatedLicenseId': True, 'isFsfLibre': True, 'isOsiApproved': True, 'licenseId': 'GPL-3.0+', 'name': 'GNU General Public License v3.0 or later', 'referenceNumber': 149}

From looking at the upstream https://spdx.org/licenses/ page I'm guessing the mapping may be possible based on the name attribute of the data for each license:

>>> pprint.pprint(list(filter(lambda x: x["name"] == "GNU General Public License v3.0 or later", spdx_license_list.LICENSES.values())))
[{'isDeprecatedLicenseId': True,
  'isFsfLibre': True,
  'isOsiApproved': True,
  'licenseId': 'GPL-3.0+',
  'name': 'GNU General Public License v3.0 or later',
  'referenceNumber': 149},
 {'isDeprecatedLicenseId': False,
  'isFsfLibre': True,
  'isOsiApproved': True,
  'licenseId': 'GPL-3.0-or-later',
  'name': 'GNU General Public License v3.0 or later',
  'referenceNumber': 415}]

One thing that we can do pretty easily is filter the deprecated identifiers out of the select lists built by the UI. That should be as simple as adding ?deprecated=false to the call to our /api/spdx/ endpoint which is made from the getSpdxLicenses method of our vue/src/store/tools.js controller.

@bd808 thinking about moving this to Groomed/Medium priority as it sounds like there's a path fwd. Thoughts?

@bd808 thinking about moving this to Groomed/Medium priority as it sounds like there's a path fwd. Thoughts?

There are paths forward, but not clear scoping. I'd be +1 on doing the exclusion of the deprecated options from the pick list.

I'm not sure about the value in attempting to canonicalize the stored values to replace deprecated identifiers with non-deprecated identifiers. Mostly I'm not sure where we would do that transformation to make it consistent over time. Doing it on submission/edit would not be overly difficult, but that would also not handle any future deprecations meaning that if SPDX deprecates another identifier next week or in 2 years an input transform would require edits to all records using the now deprecated selector. If we do it on the output side we could avoid the "what about the future" question, but we would also be wasting a lot of compute cycles scanning everything on the way out from the API with a low hit rate for actual changes.

I think I understood most of this haha. Is there any way for us to see the history of how often licenses are deprecated? If it's every 2 years, I could see relying on a manual update is fine, but if it's more frequent, I'd consider whether there's another source we could listen to for these types of changes?

I'd be +1 on doing the exclusion of the deprecated options from the pick list.

+1 too

I'm not sure about the value in attempting to canonicalize the stored values to replace deprecated identifiers with non-deprecated identifiers. Mostly I'm not sure where we would do that transformation to make it consistent over time. Doing it on submission/edit would not be overly difficult, but that would also not handle any future deprecations meaning that if SPDX deprecates another identifier next week or in 2 years an input transform would require edits to all records using the now deprecated selector. If we do it on the output side we could avoid the "what about the future" question, but we would also be wasting a lot of compute cycles scanning everything on the way out from the API with a low hit rate for actual changes.

Agreed, I think skipping canonicalizing is fine.

bd808 triaged this task as Medium priority.Dec 2 2021, 11:03 PM
bd808 moved this task from Backlog to Groomed/Ready on the Toolhub board.
bd808 changed the subtype of this task from "Task" to "Bug Report".

One thing that we can do pretty easily is filter the deprecated identifiers out of the select lists built by the UI. That should be as simple as adding ?deprecated=false to the call to our /api/spdx/ endpoint which is made from the getSpdxLicenses method of our vue/src/store/tools.js controller.

@bd808 thinking about moving this to Groomed/Medium priority as it sounds like there's a path fwd. Thoughts?

There are paths forward, but not clear scoping. I'd be +1 on doing the exclusion of the deprecated options from the pick list.

I'm not sure about the value in attempting to canonicalize the stored values to replace deprecated identifiers with non-deprecated identifiers. Mostly I'm not sure where we would do that transformation to make it consistent over time. Doing it on submission/edit would not be overly difficult, but that would also not handle any future deprecations meaning that if SPDX deprecates another identifier next week or in 2 years an input transform would require edits to all records using the now deprecated selector. If we do it on the output side we could avoid the "what about the future" question, but we would also be wasting a lot of compute cycles scanning everything on the way out from the API with a low hit rate for actual changes.

@bd808 I'm wondering if we can have a solution that involves the UI and The backend in the form of a background task. We can go with the deprecated==false approach, and in addition to that create a background task that runs every couple of months to check for these. Because of the low hit rate, I think the best and performant way to solve this is to use an asynchronous solution (in addition to making sure that users can't select a deprecated license identifier after it has been deprecated).

We also might be able to run checks when the crawler is fetching new records and updating old ones

How about we keep things simple and just do the exclusion of deprecated identifiers from the input?

With our current production restrictions and issues related to cronjobs (T292861) a task that actually runs every month would still consume a Kubernetes Pod with at least 1 inner container which would never terminate. I can very much see the value in removing the deprecated values from the pick list we present to users in the UI, but I don't think that deprecated values cause any harm at all if found in the data.

How about we keep things simple and just do the exclusion of deprecated identifiers from the input?

With our current production restrictions and issues related to cronjobs (T292861) a task that actually runs every month would still consume a Kubernetes Pod with at least 1 inner container which would never terminate. I can very much see the value in removing the deprecated values from the pick list we present to users in the UI, but I don't think that deprecated values cause any harm at all if found in the data.

In that case just the ui solution should work. Can I assign this to myself? would love to work on it

@bd808 If the task description looks good to you, we can kick this off to @Raymond_Ndibe

Change 753525 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[wikimedia/toolhub@main] ui: Remove deprecated licenses

https://gerrit.wikimedia.org/r/753525

Raymond_Ndibe moved this task from Radar to Review on the Toolhub board.

Change 753525 merged by jenkins-bot:

[wikimedia/toolhub@main] ui: Remove deprecated licenses

https://gerrit.wikimedia.org/r/753525

Change 770638 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-002555-production

https://gerrit.wikimedia.org/r/770638

Change 770638 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-002555-production

https://gerrit.wikimedia.org/r/770638