Page MenuHomePhabricator

Figure out how to deal with duplicate toolinfo records
Open, Needs TriagePublic

Description

Duplicate toolinfo records are being noticed more in Toolhub than they were in Hay's Directory. A few examples are:

Duplicates can appear for various reasons. Some are caused by Toolforge publishing a toolinfo record for a tool that has also created and published a toolinfo.json file itself (both crawler managed). Others are an example of a crawled toolinfo.json record (likely from Toolforge or a tool that scrapes an on-wiki listing) and a toolinfo record created directly in Toolhub. It is also completely possible to have duplicates that are both created directly in Toolhub.

Having multiple records describing the same content is not ideal. It is highly unlikely that the duplicates will have the same content. It will also be difficult for users to determine which record is more correct.

Event Timeline

Would be nice if tool author could somehow take ownership of the tool and be able to clean up and edit details of the tool.

I think the procedure could look something like this:

  1. Sing in on toolhub.
  2. Go to: https://toolhub.wikimedia.org/tools/dna.
  3. Click "Take ownership" -> this would generate a code like e.g. sha1(random())-${userId} (which would be added to some userKeys table). -> e.g key: 5fab-49ea3-ebf0c-123 -> Instruct user to paste the key into "https://dna.toolforge.org/toolhub-owner-5fab-49ea3-ebf0c-123.html"
  4. User does that. And uses a button "Validate ownership". Bot goes to the the page to validate ownership.
  5. User gets a button "Edit" and "Merge".
  6. Merge opens a form that allows to type in another tool.
  7. Submit creates a redirect to another tool entry.

Redirect-entry should probably be hidden from search results.

Would be nice if tool author could somehow take ownership of the tool and be able to clean up and edit details of the tool.

Related: T191955: [Investigation] Ability for users to "claim" Toolhub entries

I think the procedure could look something like this:

  1. Sing in on toolhub.
  2. Go to: https://toolhub.wikimedia.org/tools/dna.
  3. Click "Take ownership" -> this would generate a code like e.g. sha1(random())-${userId} (which would be added to some userKeys table). -> e.g key: 5fab-49ea3-ebf0c-123 -> Instruct user to paste the key into "https://dna.toolforge.org/toolhub-owner-5fab-49ea3-ebf0c-123.html"
  4. User does that. And uses a button "Validate ownership". Bot goes to the the page to validate ownership.

Magic URL validation could work for web applications, but would not be a complete solution to the general problem. Other use cases to consider include on-wiki tools (user scripts, gadgets, lua modules, templates), locally installed applications (desktop/laptop, mobile), and bots.

A little thought about solving the duplicate tool problem (or atleast making progress) :
Since the url field is mandatory for all tools and is one direct link from toolhub to the tool being described, one way of solving this (or atleast solving a subset of this problem) is to create a RepoURL model which is one-to-many related to the toolinfo model. Then whenever a duplicate tool (that points to the same tool repo) is added, the RepoURL model instance will have two tools pointing to it. In this situation we can do interesting things like:

  • Displaying an is possible duplicate of flag on the tool.
  • Adding the tools to a duplicate tools list (or some variations).

And if if a "duplicate" tool is not pointing to the same authoritative repository as the other duplicate, are they really duplicates?

( Maybe this is better titled as better discovery for duplicate tools )