Page MenuHomePhabricator

Figure out the topic of articles translated automatically by external translation service
Closed, ResolvedPublic

Description

Goal

The goal of this analysis is to figure out the topics readers are interested in, but those articles are not available (or their quality is not good) in their local language.

Potential Usage

We can recommend those popular topics to editors in local communities.

Data

We will use pages viewed and translated by Google in March 2019. Since the vast majority of these pages are translated from English to other language, the first exploration will only check the topics of articles on English Wikipedia.

We break down the translated pageviews by two types:

  • Pageviews translated by Toledo (Google integrate automatic translated pages in search results). These articles represent 1) Google thinks the quality of contents in local languages is not as good as translated pages AND 2) Users are interested in these articles and thus click through the search results.
  • User initiated translation. Users paste the article links into Google translate, or click on the "Translate this page" link from their search result. In this case, users are well aware that they are reading a translated article and willing to put more effort to do that, which is an indication of a stronger interest in the articles. We break down the analysis by translation target languages.

Tools

We use the ORES draft topic model to get the topics of articles. The outcome topic is the mid-level categories of WikiProject directory (see the hierarchy).

Event Timeline

chelsyx triaged this task as Low priority.Mar 29 2019, 6:00 PM
chelsyx created this task.
chelsyx moved this task from Triage to Doing on the Product-Analytics board.
chelsyx renamed this task from Model the topic of articles translated automatically by external translation service to Figure out the topic of articles translated automatically by external translation service.Apr 1 2019, 5:06 PM
chelsyx raised the priority of this task from Low to Medium.Apr 5 2019, 7:10 PM
chelsyx updated the task description. (Show Details)Apr 25 2019, 7:39 PM
chelsyx updated the task description. (Show Details)Apr 25 2019, 8:41 PM
chelsyx updated the task description. (Show Details)Apr 25 2019, 9:03 PM

The first exploration analysis is done: https://analytics.wikimedia.org/datasets/external-automatic-translation/Topics%20of%20articles%20translated%20by%20Google.html

I looked at the pageviews served by Toledo, and user initiated translations from Indonesian and Hindi users. The key take-aways are:

  • 56.3% pageviews served by Toledo are STEM (science, tech and engineering) related, which seems to aligned with the information from Google.
  • Among STEM, Medicine is the most popular, followed by Biology.
  • Indonesian and Hindi users like to translate and read articles about Countries on English Wikipedia.
  • Comparing articles served by Toledo (the majority of their readers are Indonesian) with Indonesian users initiated translation, it seems Indonesian readers's demand for Culture related contents (43.2% of Indonesian users initiated translation) haven't been fully fullfilled by Toledo yet.

@Pginer-WMF @atgo @Arrbee Please let me know if you have any questions. Also I would really appreciate it if you can provide any suggestions about the next step.

@Isaac I'm wondering whether this analysis result is aligned with what you see in the research of language switch between wikis. :)

Isaac added a comment.May 2 2019, 2:49 PM

This is awesome @chelsyx !

One thing I'd suggest for your analysis: the ORES API prediction for draft topic is a list of topics that exceeds some threshold of probability. That list is sorted not by probability though but by the order of the labels here: https://github.com/wikimedia/drafttopic/blob/7361fd9a6dc12079cac135273a02e602ccf6c2c0/model_info/enwiki.drafttopic.md
In my analysis, taking the first prediction overemphasized geography substantially because about half the articles I looked at had more than one predicted topic. What I ended up doing was taking the highest probability topic (code below). I also tested taking a random topic from the predictions or all of them and got about the same results. I'd suggest updating your analysis and then we can compare.

def get_pred_topic_best(input_json):
    try:
        topics = input_json['score']['drafttopic']['score']['probability']
        best = sorted(topics, key=topics.get, reverse=True)[0]
    except (IndexError, KeyError) as error:
        best = None
    return best

My analysis

Overview: my analysis uses a dataset of instances where a given reader switches from reading an article in one language to another (more info here). The results then reflect the topics where a reader is viewing an article in Hindi or Indonesian and then chooses to read that article in another language.

A few notes:

  • I'm using a 10% sample of reader sessions from March 1-7 (inclusive), so much less data than you.
  • You can see the underlying query that gathers the page views in the README here: https://github.com/geohci/language-switching
    • That repo also has the code I used to run this replication
    • Most notably, I'm also tossing out reader sessions w/ more than 1000 page views in a given day as likely undetected bots.

What is a count in my results?

Indonesian

              topic  count  proportion
            Culture  12698    0.476044
          Geography   7244    0.271575
               STEM   3375    0.126528
History_And_Society   2961    0.111007
               None    372    0.013946
         Assistance     24    0.000900

                                      topic  count  proportion
                        Geography.Countries   6418    0.240609
            Culture.Language and literature   2703    0.101335
                             Culture.Sports   2692    0.100922
                      Culture.Entertainment   2574    0.096498
                    Culture.Performing arts   1320    0.049486
            Culture.Philosophy and religion   1228    0.046037
                               STEM.Biology    977    0.036627
         History_And_Society.Transportation    777    0.029129
                            STEM.Technology    734    0.027517
                           Geography.Europe    682    0.025568
 History_And_Society.Business and economics    665    0.024931
   History_And_Society.Military and warfare    598    0.022419
                        Culture.Visual arts    571    0.021407
    History_And_Society.History and society    556    0.020844
                     Culture.Food and drink    551    0.020657
                       Culture.Broadcasting    503    0.018857
                              STEM.Medicine    492    0.018445
                   Culture.Internet culture    426    0.015971
                             STEM.Chemistry    384    0.014396
                                       None    372    0.013946
History_And_Society.Politics and government    279    0.010460
                               STEM.Physics    184    0.006898
                                 STEM.Space    155    0.005811
                           STEM.Mathematics    104    0.003899
                                               ...

Hindi

              topic  count  proportion
          Geography  11035    0.499005
            Culture   4738    0.214253
               STEM   3366    0.152211
History_And_Society   2864    0.129511
               None     81    0.003663
         Assistance     30    0.001357

                                      topic  count  proportion
                        Geography.Countries  10523    0.475852
            Culture.Language and literature   1845    0.083431
            Culture.Philosophy and religion   1063    0.048069
                             Culture.Sports    844    0.038166
                               STEM.Biology    812    0.036719
    History_And_Society.History and society    670    0.030298
   History_And_Society.Military and warfare    638    0.028851
History_And_Society.Politics and government    608    0.027494
 History_And_Society.Business and economics    576    0.026047
                              STEM.Medicine    558    0.025233
                            STEM.Technology    428    0.019354
                             STEM.Chemistry    405    0.018314
                               STEM.Physics    383    0.017319
                      Culture.Entertainment    303    0.013702
                                 STEM.Space    287    0.012978
                     Culture.Food and drink    287    0.012978
         History_And_Society.Transportation    259    0.011712
                           Geography.Europe    230    0.010401
                  Geography.Bodies of water    194    0.008773
                           STEM.Geosciences    120    0.005426
              History_And_Society.Education    113    0.005110
                           STEM.Engineering    108    0.004884
                       Culture.Broadcasting    101    0.004567
                                               ...

@Isaac Thanks for pointing out my mistake! I've updated my report and T219660#5139254 .

Comparing with your analysis, my take-aways are:

  • Users really like to read articles about Countries in their own languages.
  • There seems to be less STEM related language switch on wiki. My guess is that those articles are not available in the local languages.
Isaac added a comment.May 3 2019, 4:52 PM

There seems to be less STEM related language switch on wiki. My guess is that those articles are not available in the local languages.

Yeah, I'd agree and also expect that this is somewhat Google's bias in what signals they use to choose articles to translate.

One of the questions I'm hoping to get some insight into is how rapidly evolving these different topic areas (e.g., STEM, Geography, Culture) are and whether the different categories are better/worse candidates for translation. For instance, with Geography.Countries, I expect if you translate the content over, it'll be quite useful even if not updated for quite a while and that major changes to the content in the source article are pretty likely to show up in the translated article too. However, for more rapidly-evolving topics like Entertainment (or some categories of STEM), the editor community in smaller languages may have more trouble keeping up with changes and therefore these topics might be better candidates for on-the-fly translation vs. a one-time translation that potentially goes stale.

kzimmerman closed this task as Resolved.Jun 4 2019, 9:41 PM
kzimmerman added subscribers: MNovotny_WMF, kzimmerman.

@MNovotny_WMF @Pginer-WMF this may help guide considerations of what kind of topics we should prioritize for translation, and also to consider in communication with the community about Toledo's focus. Also, as Chelsy mentioned, it looks like Toledo hasn't met demand for Culture-related content.

chelsyx added a comment.EditedJun 4 2019, 10:48 PM

Caveat of this analysis: The topic assignment method we used for this project -- ORES draft topic model -- is not perfect for summarizing the content of articles. For example, we have seen a lot of articles been assigned to the topic of Geography.Countries, while they can fit into other topics better from a human's eye. We want to try other topic modeling methods in the future to get a better idea of our translated content.