Page MenuHomePhabricator

Experiment with Topic modeling in KubeFlow
Closed, DeclinedPublic

Description

The goal of this task is to get a working, query-able KubeFlow instance that replicates the functionality of our article topic models.

For example, if you provide a revision ID to ORES, it gives you a topic prediction. https://en.wikipedia.org/w/index.php?title=Ann_Bishop_(biologist)&oldid=937084701 links to a specific version of the article about Ann Bishop. We can provide that ID (937084701) to ORES and ask for a topic prediction with this query: https://ores.wikimedia.org/v3/scores/enwiki/937084701/articletopic

We get a result that looks like this:

[...]
"prediction": [
              "Culture.Biography.Biography*",
              "Culture.Biography.Women",
              "History and Society.History",
              "STEM.Medicine & Health",
              "STEM.STEM*"
            ],
            "probability": {
              "Culture.Biography.Biography*": 0.9897817848322134,
              "Culture.Biography.Women": 0.9723014590702798,
              "Culture.Food and drink": 0.00035026153227330815,
              "Culture.Internet culture": 0.00018265332725578013,
              "Culture.Linguistics": 0.000538261749894609,
              "Culture.Literature": 0.03619774139697521,
              "Culture.Media.Books": 0.002291345684133623,
              "Culture.Media.Entertainment": 0.0006008800036119385,
              "Culture.Media.Films": 0.000301603482794946,
              "Culture.Media.Media*": 0.006322510257768613,
              "Culture.Media.Music": 0.0003659734842755226,
              "Culture.Media.Radio": 9.197231374768e-05,
              "Culture.Media.Software": 0.00015020779276450603,
              "Culture.Media.Television": 0.00030685931095892125,
              "Culture.Media.Video games": 3.5082704979932084e-05,
              "Culture.Performing arts": 0.0010131663084430853,
              "Culture.Philosophy and religion": 0.07665653521788206,
              "Culture.Sports": 0.0047852147462091885,
              "Culture.Visual arts.Architecture": 0.0016610955992546084,
[...]

Our pipeline looks roughly like this:

  • Model
    • Extracted features and labels
      • Text and labels
        • Balanced labels
          • Labeled Wikidata items with Sitelinks - file
            • Wikidata items with Sitelinks and WikiProjects - file
            • Taxonomy - file
      • Truncated text embeddings files
        • Text embeddings
          • Text chunks with labels
            • Labeled Wikidata items with Sitelinks - file
            • XML dump - files ("pages-articles")

Event Timeline

@ClaytonLLemons just checking in on this and seeing if you need anything. I've got a bit more bandwidth now, so happy to help out or answer any questions.

Declining for now. We're doing a more fundamental exploration of model management frameworks and we might come back to this at some point.