Page MenuHomePhabricator

Controlled vocabulary and taxonomy for Toolhub
Closed, ResolvedPublicGoal

Description

The goal of the proposed taxonomy is to enable faceted browsing of tools in the Toolhub UI. The semantic framework of the taxonomy should be logical and consistent enough to support extension of attributes without refactoring or restructuring metadata as the catalog of tools grows. The scope of the proposed taxonomy is limited to attributes that require human curation. These are attributes for which there exists no clear set of appropriate subclasses nor attribute values.

The goal of the controlled vocabulary is to enable recall of all relevant tools for a given attribute, regardless of the varied language humans may use to describe or search for the tool. This is especially important for standardizing input as Toolhub allows and encourages community-provided data for tool annotations. The scope of the proposed controlled vocabulary is limited to terms that require disambiguation, have multiple spellings, or are acronyms.

Full details are in this Google doc and will be summarized more succinctly on-wiki soon.

Related Objects

StatusSubtypeAssignedTask
ResolvedGoalbd808
ResolvedGoalbd808
DuplicateNone
DuplicateNone
ResolvedRaymond_Ndibe
ResolvedRaymond_Ndibe
ResolvedSpikebd808
ResolvedBUG REPORTbd808
ResolvedSpikebd808
ResolvedSlst2020
ResolvedFeatureRaymond_Ndibe
ResolvedTBurmeister
ResolvedFeaturebd808
ResolvedFeaturebd808
ResolvedFeaturebd808
ResolvedFeaturebd808
ResolvedFeaturebd808
InvalidFeatureNone
ResolvedFeaturebd808
ResolvedFeaturebd808

Event Timeline

TBurmeister changed the task status from Open to In Progress.May 10 2022, 2:42 PM
TBurmeister claimed this task.
TBurmeister triaged this task as Medium priority.
TBurmeister created this task.
TBurmeister moved this task from Backlog to In Progress on the Toolhub board.

Only one remaining TODO for this, but it's not blocking the community outreach and feedback phases which I believe are being owned by @komla:

Follow-up about comments in Toolhub team chat relating to "language" or "platform" attribute we might still want to control / standardize. Comments from team Slack conversation with @Raymond_Ndibe and @bd808 :

  • From @bd808: Before and after toolinfo data from Hackathon 2022: https://toolhub.wikimedia.org/tools/mm_mixnmatch/history/revision/12666/diff/24396
  • Raymond_Ndibe: It's also starting to seem like making "language" (as in the language the tool is built on), to be only editable by the tool owner wasn't such a great idea. It's such a low hanging fruit, yet never provided.
  • bd808: Yeah, I think the hardest part of technology is really that we don't have a controlled grammar defined for it. That makes the data really a mess usually as people will use terms like "js", "node", "Node.js", "JavaScript", etc to describe the same thing. We could make it an annotation field pretty easily, but I think we should find a way to make it more standard (maybe wikidata QIDs?) before we try to get lots of folks to fill it out.
  • TBurmeister : I have data/ideas for this from my taxonomy analysis journey if we want to make it a controlled attribute
  • Raymond_Ndibe: yes this seems like a great candidate for controlled attributes

Based on the above, my action item is to review the values I had extracted for this possible attribute, and also review the tool annotations that people added during the Hackathon, and add a "programming language" or similar attribute to the proposed taxonomy.

The most common languages that showed up in the lists / data models I analyzed were:

I did an investigation into whether we could use Wikidata properties to provide or validate a list of values for "programming languages" instead of manually curating the list. Namely, I investigated if we could just allow a tool attribute with values that are item labels for items that are instances of programming language (Q9143) .

Where this gets interesting is that these are not all "technically" programming languages, so to use Wikidata items in this way, we would potentially need to include all items that are instances of the "Programming language" class, but also all those that are instances of all its subclasses. This is because, for example, Lua (Q207316) is only an instance of "Programming language" through being an instance of one of its subclasses. I suppose we could add a statement to make Lua a direct instance of "Programming language", but that is beyond the scope of this issue (and probably not graph modeling best practice).

There are additional complications: I'm not sure if people would expect to see JSON in this list of values, so maybe we could remove it. If we wanted to include it, using the Wikidata item "programming language" wouldn't even get us there because none of the "instance of" statements for JSON (Q2063) connect to sublcasses of "programming language". This is also true for mysql and sparql, though sql is an instance of programming language. :-/

Conclusion: if we want to use QIDs for a "programming language" attribute in Toolhub, we should probably just curate the list of them manually rather than trying to use properties from the Wikidata graph to automatically provide / validate a larger set of values.

I *think* the next step for this is to analyze the contents of the technology_used field and annotations that people added since the Hackathon, and see how many of them are programming languages, or values that would make sense in a controlled "Programming language" attribute. I already did some of that when I was creating my first draft of the v2 taxonomy, but I want to see how much overlap there is between that field and annotations people have added.

Conclusion: if we want to use QIDs for a "programming language" attribute in Toolhub, we should probably just curate the list of them manually rather than trying to use properties from the Wikidata graph to automatically provide / validate a larger set of values.

The subclass thing in wikidata isn't especially surprising to me. It also seems reasonable to curate the list. The main thing that is compelling about Wikidata QIDs as the user input is that we then get "free" translations for the UI by fetching the locale specific labels from Wikidata at render time. QIDs are probably not ideal if we expect users to search directly on the values rather than interacting with some kind of pick list, but I think that's ok for this use case?

sparql query for all instances of a "programming language" or sub-type: https://w.wiki/5P3U

I *think* the next step for this is to analyze the contents of the technology_used field and annotations that people added since the Hackathon, and see how many of them are programming languages, or values that would make sense in a controlled "Programming language" attribute. I already did some of that when I was creating my first draft of the v2 taxonomy, but I want to see how much overlap there is between that field and annotations people have added.

We have not yet exposed technology_used as an annotation, so I doubt that we have enough usage of that field to make an analysis of existing data highly informative.

I think my work on this is done and it's ready for the next phase of the project (which is not owned by me), so bouncing it over to bd808.

bd808 changed the task status from In Progress to Open.Jul 20 2022, 9:12 PM
bd808 removed bd808 as the assignee of this task.
bd808 raised the priority of this task from Medium to High.
bd808 changed the subtype of this task from "Task" to "Goal".
bd808 changed the task status from Open to In Progress.Nov 7 2022, 10:03 PM
bd808 claimed this task.
bd808 moved this task from To-Do to In Progress on the Developer-Advocacy (Oct-Dec 2022) board.

Change 861960 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] ui(edit): Reorganize edit toolinfo fields

https://gerrit.wikimedia.org/r/861960