In T236713: Improve drafttopic training data pipeline, @Halfak, @Isaac, and @MMiller_WMF worked on the ontology for the new articletopic model. It is shown below, and includes 64 leaf nodes in a tree that has up to four levels. In order to use them in the UI for newcomer tasks, we have to do two things:
- Mapping: map the topics in the ontology to the buttons that users will actually be able to click.
- Thresholds: decide how to set score thresholds for whether an article "counts" as belonging to a topic.
Mapping
Given that the new ontology has 64 leaf nodes, we will likely want to roll that up to something like 30 or fewer, because we believe that any more will be overwhelming to the user. This means that we may combine some, like "Biology", "Chemistry", "Physics", and "Space" all into a label we call "Science".
In the interface, we may want to put topics under hard-coded headings, like putting "Science", "Technology", "Engineering" under a "Math and Science" header (as shown in this mockup). So we would need to decide those.
One other consideration is whether local-language geographies should be exposed as a "special topic". In other words, we plan to expose "Geography" as a topic in all languages, but perhaps it would also be nice to expose "Eastern European geography" for Czech Wikipedia or "Southeast Asian geography" for Vietnamese Wikipedia, etc.
Thresholds
We have to decide when an article "counts" as being part of a topic.
At one point, we attempted to count an article as part of a topic if its highest topic score was that topic. This will likely not work going forward, as the ORES models can score Geography very accurately, and so most local language topics would end up with a highest score of geography.
It may be simple to not worry about thresholds and just return the articles sorted by score. The downside, though, is that we don't want all newcomers to receive the same articles, and so we need to sort randomly at some threshold cutoff to give each newcomer their own unique set.
Ontology
Below is the articletopic ontology. Here's how it works:
- Each row is a "leaf node", and every article gets a separate independent score for each one.
- The leaves range from level 2 to level 4 in the ontology.
- The asterisked topics, e.g. Culture.Media.Media*, are "catch-all" topics. The best way to think about these is that they would be leaf nodes, except there are a couple sub-topics we wanted to break out specifically, which may not actually cover the full breadth of the asterisked topic.
Culture.Biography.Biography* Culture.Biography.Women Culture.Food and drink Culture.Internet culture Culture.Linguistics Culture.Literature Culture.Media.Books Culture.Media.Entertainment Culture.Media.Films Culture.Media.Media* Culture.Media.Music Culture.Media.Radio Culture.Media.Software Culture.Media.Television Culture.Media.Video games Culture.Performing arts Culture.Philosophy and religion Culture.Sports Culture.Visual arts.Architecture Culture.Visual arts.Comics and Anime Culture.Visual arts.Fashion Culture.Visual arts.Visual arts* Geography.Geographical Geography.Regions.Africa.Africa* Geography.Regions.Africa.Central Africa Geography.Regions.Africa.Eastern Africa Geography.Regions.Africa.Northern Africa Geography.Regions.Africa.Southern Africa Geography.Regions.Africa.Western Africa Geography.Regions.Americas.Central America Geography.Regions.Americas.North America Geography.Regions.Americas.South America Geography.Regions.Asia.Asia* Geography.Regions.Asia.Central Asia Geography.Regions.Asia.East Asia Geography.Regions.Asia.North Asia Geography.Regions.Asia.South Asia Geography.Regions.Asia.Southeast Asia Geography.Regions.Asia.West Asia Geography.Regions.Europe.Eastern Europe Geography.Regions.Europe.Europe* Geography.Regions.Europe.Northern Europe Geography.Regions.Europe.Southern Europe Geography.Regions.Europe.Western Europe Geography.Regions.Oceania History and Society.Business and economics History and Society.Education History and Society.History History and Society.Military and warfare History and Society.Politics and government History and Society.Society History and Society.Transportation STEM.Biology STEM.Chemistry STEM.Computing STEM.Earth and environment STEM.Engineering STEM.Libraries & Information STEM.Mathematics STEM.Medicine & Health STEM.Physics STEM.STEM* STEM.Space STEM.Technology