Page MenuHomePhabricator

Produce flow diagrams illustrating translation imbalances
Closed, ResolvedPublic

Description

We have a rough diagram of the imbalances in translation flow,
https://commons.wikimedia.org/wiki/File:WikipediaTranslationAssistant_translation-directions_22-07.png

Using these resources:

  • public, Content Translation data source
  • the example R code and rendered Sankey diagram
  • an open source programming language like R or Python (and the relevant libraries you need for the task)

Please do this:

  • refine and extend the illustration to bring out additional details. For example, leave out English to see the relationships between the remaining languages. Create a diagram of a smaller subset of languages that show intriguing imbalances or balances.
  • Experiment with other types of diagram, for example a scatterplot of languages with x=translations from and y=translations to; diagrams using the ratio of translations from and translations to, etc.
  • Harder task: Try making a scatterplot of language translation ratio against the wiki article count. Attention: There might not be a convenient data source for this.

Event Timeline

awight changed the visibility from "Public (No Login Required)" to "acl*outreachy-mentors (Project)".
awight renamed this task from task 4 to Produce flow diagrams illustrating translation imbalances.Mar 4 2023, 7:20 PM
awight updated the task description. (Show Details)
srishakatux changed the visibility from "acl*outreachy-mentors (Project)" to "Public (No Login Required)".Mar 6 2023, 8:32 PM

Hello @awight, I'm Jude, an outreachy applicant. I'm so much hyped to contribute to this project.
Could you please let me know if we are only allowed to contribute to this Microtask using R or can I use Python?

Hello @awight, I'm Jude, an outreachy applicant. I'm so much hyped to contribute to this project.
Could you please let me know if we are only allowed to contribute to this Microtask using R or can I use Python?

Greetings! Please use any language you wish (except for proprietary software), or a mix of languages. Feel free to work in a new, public respository in GitLab, GitHub, etc...

Hello, @awight I have been working on this project and wanted to clarify some details.

Regarding the creation of the Sankey diagram, I would like to focus on a smaller subset of languages to highlight intriguing imbalances or balances. I plan to filter out translations involving other languages and create a diagram or scatterplot that displays the translation flows between the selected languages. Please let me know if this approach aligns with your expectations.

Additionally, I am planning to create a scatterplot diagram that uses the ratio of translations from and translations to. However, I am unsure about creating a scatterplot that displays the translation ratio against the wiki article count, as there may not be a convenient data source available. I am willing to research this topic further but would like to confirm whether this is a requirement for the project. I appreciate your input and look forward to your response.

Hello @awight! I was working on this micro task and was studying about chord diagrams and scatterplots in R since I am new to this. I was able to create one for my demo and understanding purposes. While I was looking at the JSON data and since the sankey diagram available to us isn't interactive, I was wondering what shall be the value to plot the arc between the source and target. I'm a bit lost here.
As my next step I would be making one wherein English is excluded and to see the relationships between the remaining.
Thanks in advance!

Note: Please do not claim tasks. Several people can work on the same tasks but the system does only allow one person to "claim" it.

Hello, @Simulo @awight I am trying to write code in python to create a scatterplot diagram of languages and I think there may be an issue with the API response format. I am having an error that the translations list might contain strings instead of dictionaries. Please provide guidance on how to resolve the issue. Thank you in advance.

@Simulo @awight. I would like to contribute to this Project. Can I use Power BI for this task?

Hello @awight, I'm Jude, an outreachy applicant. I'm so much hyped to contribute to this project.
Could you please let me know if we are only allowed to contribute to this Microtask using R or can I use Python?

Greetings! Please use any language you wish (except for proprietary software), or a mix of languages. Feel free to work in a new, public respository in GitLab, GitHub, etc...

Please can I use Power BI for this task?

@Maryam_Gbemisola: You can use Power BI (or any other tool) to explore the data, for final results we would prefer if they are created with an open source software (Python, R, calc, Orange3…).

@Maryam_Gbemisola: You can use Power BI (or any other tool) to explore the data, for final results we would prefer if they are created with an open source software (Python, R, calc, Orange3…).

Hello, @Simulo I have tried to create a Sankey diagram Using the public Content Translation data source as a reference, I wrote Python code to generate a Sankey diagram that displays the translation relationships between several languages, such as Spanish, Italian, French, Japanese, Korean, Vietnamese, Chinese, Portuguese, Arabic, and more. I refined and extended the illustration to highlight additional details and intriguing imbalances or balances in translation. This is the link to the original live Sankey diagram http://127.0.0.1:5500/sankey_diagram.html

also, I tried to create a scatter plot diagram for the ratio of translations from one language to another. I write the code in python to retrieve translation data from the Wikipedia API and create scatter plots to visualize the number of translations between different languages. please have a look at the code https://github.com/anshikabhatt/flow-diagrams-illustrating-translation-imbalances and give me your review on it. I am open to any kind of feedback. as this is my first time creating a diagram if I made any mistake along the way please correct me and give me your valuable feedback.

also, I attempted to create a scatterplot diagram of the translation ratio against the Wikipedia article count for different languages. However, I was unable to find a convenient source of data for this purpose. As a result, I estimated the translation ratios based on my research. it is important to note that these are rough estimates, and the actual translation ratios may be different. Additionally, there is no way to verify the accuracy of these estimates as there is no official data available. I appreciate your input and look forward to your response. @awight

Thanks in advance!

Good day @Simulo and @awight can I make use of python to explore the data please??

Please is there anyone that can assist me? I have been on this task for days. I am trying to create the Sankey diagram with the JSON data provided. I have been able to import the data from the API and parse it into a dictionary so I can manipulate the data and visualize it. Right now, I am stuck processing the data, it keeps giving me error messages which show that it is not finding the correct path. I have tried to debug or get an alternative but it is not working.
@Simulo @awight or anyone that can please assist

@Manuellabubakar Hi, please ask in a support forum, and include exact steps to reproduce and error messages in that support forum (without paraphrasing). Thanks!

can I make use of python to explore the data please??

Yes, python is no problem!

can I make use of python to explore the data please??

Yes, python is no problem!

Hello @awight and @Simulo This is my submission for this task https://github.com/anshikabhatt/flow-diagrams-illustrating-translation-imbalances. Please have a look and give me your feedback on this. I have tried to create a Sankey diagram Using the public Content Translation data source as a reference, I wrote Python code to generate a Sankey diagram that displays the translation relationships between several languages, such as Spanish, Italian, French, Japanese, Korean, Vietnamese, Chinese, Portuguese, Arabic, and more. I refined and extended the illustration to highlight additional details and intriguing imbalances or balances in translation. This is the link to the original Sankey diagram {F36918157}
also, I tried to create a scatter plot diagram for the ratio of translations from one language to another. I write the code in python to retrieve translation data from the Wikipedia API and create scatter plots to visualize the number of translations between different languages.

also, I attempted to create a scatterplot diagram of the translation ratio against the Wikipedia article count for different languages. However, I was unable to find a convenient source of data for this purpose. As a result, I estimated the translation ratios based on my research. it is important to note that these are rough estimates, and the actual translation ratios may be different. Additionally, there is no way to verify the accuracy of these estimates as there is no official data available. I appreciate your input and look forward to your response.

@awight @Simulo Here is my submission for this task. Actually I am not taking this as completed because there are still a lot more things that can be tried with the data but i wanted to register my findings that I have come across up till now. So to accomplish the task what I did was first I converted the data from Here into a CSV file and loaded into R as it was mentioned in the example. Then I tried to make out diagrams from that data with the best of my understanding to find new observations. Like for my first sankey diagram I used all of the languages present in the data which gave me a result similar to the one in the example. Then I removed english from both source and target language to see which were the other languages with maximum number of translation. Observing that diagram I got three such more languages (fr, ru, es). Removing them from the target and source language I observed that their was less imbalance than previous two cases but still the translation data was highly imbalanced.
I have recorded all of my codes and the outcome diagrams in a github repository. I would be very thankful to have your views and suggestions over it.
Github Repository Link
Diagrams I got as result.

sankey_diagram_excluding_english.png (1×1 px, 345 KB)

sankey_diagram_excluding_en_es_ru_fr.png (1×1 px, 398 KB)

sankey_diagram_selected_languages.png (1×1 px, 272 KB)

chord_diagram_all_included.png (1×1 px, 328 KB)

Please is there anyone that can assist me? I have been on this task for days. I am trying to create the Sankey diagram with the JSON data provided. I have been able to import the data from the API and parse it into a dictionary so I can manipulate the data and visualize it. Right now, I am stuck processing the data, it keeps giving me error messages which show that it is not finding the correct path. I have tried to debug or get an alternative but it is not working.

Hi, it sounds like you're on the right track! Please feel free to read through other participants' scripts, for example @Abhishek02bhardwaj's https://github.com/Abhishek02bhardwaj/Flow-Diagrams-Illustrating-Translation-Imbalances/blob/main/sankey%20diagram%20all%20included.R which shows how the data can be wired through R. Is that the language you're using?

Also please reach out in Zulip if you're still stuck, and we can have more of a real-time chat there.

Pasting the actual errors will be helpful, as was mentioned already.

Hello @awight! I was working on this micro task and was studying about chord diagrams and scatterplots in R since I am new to this. I was able to create one for my demo and understanding purposes. While I was looking at the JSON data and since the sankey diagram available to us isn't interactive, I was wondering what shall be the value to plot the arc between the source and target. I'm a bit lost here.

+1 that an interactive graph would be more fun, and in later stages of this project we may end up building a Jupyter notebook or other publicly-accessible resources where readers can explore the data on their own.

I don't fully understand the question about what the value is—do you mean, what's the motivation behind creating this graph? That would simply be to represent the data in a form where we can better discover patterns, balances and imbalances. Or do you mean what is the numeric value we should use when creating the arcs? So far, we've been using the number of translations from language A to language B as the value. But it would also be interesting to plot the number of translators who have published a translation between A and B, this is also available in the public statistics.

Hello, @Simulo I have tried to create a Sankey diagram Using the public Content Translation data source as a reference, I wrote Python code to generate a Sankey diagram that displays the translation relationships between several languages, such as Spanish, Italian, French, Japanese, Korean, Vietnamese, Chinese, Portuguese, Arabic, and more. I refined and extended the illustration to highlight additional details and intriguing imbalances or balances in translation.

This is the link to the original live Sankey diagram http://127.0.0.1:5500/sankey_diagram.html

Is there something in the repo that I should run, which will serve on this port? Or maybe you're using a local Flask command? If you wish, the commands to run each script and serve the pages could be documented in the README. Also, if you want to experiment with Markdown syntax there is a way to render images in the repository inlined in the README (docs) so you could use that file to showcase and explain your work.

You might also be interested in building a Jupyter notebook which offers a nice environment for exploring visualizations using python and can be installed locally. You could import and run your existing scripts from a notebook.

also, I attempted to create a scatterplot diagram of the translation ratio against the Wikipedia article count for different languages. However, I was unable to find a convenient source of data for this purpose. As a result, I estimated the translation ratios based on my research. it is important to note that these are rough estimates, and the actual translation ratios may be different. Additionally, there is no way to verify the accuracy of these estimates as there is no official data available. I appreciate your input and look forward to your response. @awight

I don't know of a convenient data source for article count, either. It might exist but I haven't found it yet. My workaround would be to run two APIs, first the sitematrix listing of all wikis, then filter to just Wikipedias, and then run an additional siteinfo API call on each of those sites. It's kind of a pain so I've posted a CSV you can use, in this repo: https://gitlab.com/wmde/technical-wishes/wiki-article-counter

Hello, @Simulo I have tried to create a Sankey diagram Using the public Content Translation data source as a reference, I wrote Python code to generate a Sankey diagram that displays the translation relationships between several languages, such as Spanish, Italian, French, Japanese, Korean, Vietnamese, Chinese, Portuguese, Arabic, and more. I refined and extended the illustration to highlight additional details and intriguing imbalances or balances in translation.

This is the link to the original live Sankey diagram http://127.0.0.1:5500/sankey_diagram.html

Is there something in the repo that I should run, which will serve on this port? Or maybe you're using a local Flask command? If you wish, the commands to run each script and serve the pages could be documented in the README. Also, if you want to experiment with Markdown syntax there is a way to render images in the repository inlined in the README (docs) so you could use that file to showcase and explain your work.

You might also be interested in building a Jupyter notebook which offers a nice environment for exploring visualizations using python and can be installed locally. You could import and run your existing scripts from a notebook.

also, I attempted to create a scatterplot diagram of the translation ratio against the Wikipedia article count for different languages. However, I was unable to find a convenient source of data for this purpose. As a result, I estimated the translation ratios based on my research. it is important to note that these are rough estimates, and the actual translation ratios may be different. Additionally, there is no way to verify the accuracy of these estimates as there is no official data available. I appreciate your input and look forward to your response. @awight

I don't know of a convenient data source for article count, either. It might exist but I haven't found it yet. My workaround would be to run two APIs, first the sitematrix listing of all wikis, then filter to just Wikipedias, and then run an additional siteinfo API call on each of those sites. It's kind of a pain so I've posted a CSV you can use, in this repo: https://gitlab.com/wmde/technical-wishes/wiki-article-counter

@awight To view the Sankey diagram, you will need to run the sankey_diagram.py file. The diagram is currently not accessible via a link due to some technical issues. Running the script locally should allow you to view the diagram and see the translation flow between languages. this is the link to jupyter notebook where I export the existing files Sankey diagram.

{F36927858} Just to clarify, the code does not use any local Flask command or run a Flask application. Thanks for the suggestions, I will definitely consider adding the commands to run each script and serve the pages in the README file. Also, the idea of using a Jupyter notebook sounds great as it offers a nice environment for exploring visualizations using Python. I will try to create a notebook and import and run my existing scripts from there. Thank you for your input!

Hello, @Simulo I have tried to create a Sankey diagram Using the public Content Translation data source as a reference, I wrote Python code to generate a Sankey diagram that displays the translation relationships between several languages, such as Spanish, Italian, French, Japanese, Korean, Vietnamese, Chinese, Portuguese, Arabic, and more. I refined and extended the illustration to highlight additional details and intriguing imbalances or balances in translation.

This is the link to the original live Sankey diagram http://127.0.0.1:5500/sankey_diagram.html

Is there something in the repo that I should run, which will serve on this port? Or maybe you're using a local Flask command? If you wish, the commands to run each script and serve the pages could be documented in the README. Also, if you want to experiment with Markdown syntax there is a way to render images in the repository inlined in the README (docs) so you could use that file to showcase and explain your work.

You might also be interested in building a Jupyter notebook which offers a nice environment for exploring visualizations using python and can be installed locally. You could import and run your existing scripts from a notebook.

also, I attempted to create a scatterplot diagram of the translation ratio against the Wikipedia article count for different languages. However, I was unable to find a convenient source of data for this purpose. As a result, I estimated the translation ratios based on my research. it is important to note that these are rough estimates, and the actual translation ratios may be different. Additionally, there is no way to verify the accuracy of these estimates as there is no official data available. I appreciate your input and look forward to your response. @awight

I don't know of a convenient data source for article count, either. It might exist but I haven't found it yet. My workaround would be to run two APIs, first the sitematrix listing of all wikis, then filter to just Wikipedias, and then run an additional siteinfo API call on each of those sites. It's kind of a pain so I've posted a CSV you can use, in this repo: https://gitlab.com/wmde/technical-wishes/wiki-article-counter

@awight Thanks for the information. I understand that finding a convenient data source for article count can be challenging. thank you for your efforts to provide a workaround solution. I will take a look at the CSV file you provided in the repository you shared and see how it can be used for our purposes.

Please is there anyone that can assist me? I have been on this task for days. I am trying to create the Sankey diagram with the JSON data provided. I have been able to import the data from the API and parse it into a dictionary so I can manipulate the data and visualize it. Right now, I am stuck processing the data, it keeps giving me error messages which show that it is not finding the correct path. I have tried to debug or get an alternative but it is not working.

Hi, it sounds like you're on the right track! Please feel free to read through other participants' scripts, for example @Abhishek02bhardwaj's https://github.com/Abhishek02bhardwaj/Flow-Diagrams-Illustrating-Translation-Imbalances/blob/main/sankey%20diagram%20all%20included.R which shows how the data can be wired through R. Is that the language you're using?

Also please reach out in Zulip if you're still stuck, and we can have more of a real-time chat there.

Pasting the actual errors will be helpful, as was mentioned already.

@awight @Aklapper Thank you very much for your intent to help. I will give it a go again then revert.

These diagrams look very promising! If you increase the width and height, maybe more details will become visible? I see that the diagrams are hard to make sense of when languages appear in a random order, do you want to play with sorting the inputs somehow, for example sorting by number of translations published in the language?

Exciting to see the alternative Sankey view also—I think the chord diagram communicates some of the important points, but there's definitely space for improving the visualization, especially in this direction of a different layout which might capture some of the cascading and hierarchical artifacts.

These diagrams look very promising! If you increase the width and height, maybe more details will become visible? I see that the diagrams are hard to make sense of when languages appear in a random order, do you want to play with sorting the inputs somehow, for example sorting by number of translations published in the language?

Hi @awight
Yes I think I should try doing that actually I plotted these diagrams in R using RStudio and RStudio has only a full screen zoom option I think there must be something to change the height and width. I will look into it. Plus I thought instead of including all the languages it would be a nice idea to remove the languages with large number of translations so that we can see if the translation imbalances are prominent there too. I will also try and check what difference will sorting the inputs make.

Exciting to see the alternative Sankey view also—I think the chord diagram communicates some of the important points, but there's definitely space for improving the visualization, especially in this direction of a different layout which might capture some of the cascading and hierarchical artifacts.

Yes! I also think this entirely horizontal view is not the best one. I am trying to figure out a way to make this diagram better for visualisation purpose like separating the languages into two vertical columns. I think that might help to get a better understanding of the data.

Exciting to see the alternative Sankey view also—I think the chord diagram communicates some of the important points, but there's definitely space for improving the visualization, especially in this direction of a different layout which might capture some of the cascading and hierarchical artifacts.

Yes! I also think this entirely horizontal view is not the best one. I am trying to figure out a way to make this diagram better for visualisation purpose like separating the languages into two vertical columns. I think that might help to get a better understanding of the data.

The cycles and backwards-pointing flows are hard, I think.

I was imagining there might be another, simpler way to get an overview. A scatterplot with "number of translations in" on one axis and "number of translations out" on the other axis should show both the relative volume and the overall translation "ratio" (ie. out vs in) for each language.

Perhaps locating the nodes according to these axes would also lead to a more rational Sankey diagram?

I was imagining there might be another, simpler way to get an overview. A scatterplot with "number of translations in" on one axis and "number of translations out" on the other axis should show both the relative volume and the overall translation "ratio" (ie. out vs in) for each language.

@awight, so I tried to plot a scatter plot with translation in on one axis and the translation out on the other axis. The image I am attaching doesn't looks readable but if you run the python script "Generate_CSV_for_scatter_plot.py" that I have also uploaded on the GitHub repository you will be able to zoom and get a better view since I made a function to display the langauge names as comma separated list. Now from this image I think we get an interesting idea of the imbalances.
There are broadly four clusters of translations availability.
One cluster includes the famous well spoken/known languages which are used quite a lot, like en, fr,sp. One interesting thing about this cluster is that this cluster has a very high number of translations in and translations out but it has a higher number of translations out when compared to translation in which roughly means that despite these languages being a favoured choice for both source and target language they more preferred as a source language which I think quite subtly proves the informational magnetism point.
The second cluster also has a high number of translations in and translations out but this cluster has a balanced number of translation in and translations out. This cluster is well translatable both to and from.
Then there is the third cluster which is not that well translatable and then there is the fourth cluster which has really scarce translation options. I think we can focus more on developing a better translation environment for the second and third cluster.

One more thing I wanted to ask here is that the total number of languages I am getting is 159 (with two languages having no name, i am not sure why that is happening) and in the languages.yaml file the total number of languages are 322. Does that means that we are yet to provide translations for the 163 languages or is there anything wrong with my script. It would be great help if you could take a look and guide me through it.
Here are the results:

  1. Figure_2.png (612×1 px, 47 KB)
  2. Link to the Repository

the script for this scatter plot is generated from the "Genereate_CSV_for_scatterplot.py" and the CSV that saves the count of translations in and translations out is "language_counts.csv"

Also I made a scatter plot of wikis and their article counts. Again it looks incomprehensible in the image but if you run the script you can zoom and have a better look.

Figure_1.png (612×1 px, 108 KB)

the script that generates this scatter plot is "Scatter_plot_wiki_wiki_vs_Article_count"

Hi, @awight! Another update of the flow diagrams I am working on. Actually I made a really silly mistake in my previous work. So firstly I would like to mention what I am trying to do because I feel I am trying to explore a little different aspect of this task. So in this task we were supposed to use the data that was already there as the content translation data source in my previous sankey diagrams i used this data to achieve the task and it showed promising results but soon I was intrigued by the what if instead of using the data about number of articles already translated I try to use the configuration data and find out the available possible options for translations. I will try to elaborate what I am trying to do to the best of my ability here. So we know that the current system of translations does not provide the availability of translation service from every language to every language. I am trying to find more about the languages in which the translation availability is not that much (actually very less). The above scatter plot shows us the same thing.
What I did not think about in the previous part? So earlier in order to find out about the availability of translation option to and from a language I was using the configuration scrapper data. In that data I was counting that how many times a unique language (In either the source or target language) appears in the target language or the source language column. This got me values higher than 159 (which is the number of unique languages in the configuration scrapper output). So even though the scatter plot I was getting was dividing the languages in groups but the numbers were not correct.
What I have done now? Now to make sure that all the unique languages are there in the list I first created a csv and got all the unique languages in it then from those languages I counted that how many times uniquely each language appears in the target language column given that the language is the source language and saved this number as translation out (representing outwards flow of translation from that language) and for the translation in I counted how many times uniquely each language appears in the source language column given that the language is the target language and saved this number as translation in (representing inwards flow of translation from that language).
From this I got a csv which can give the translation feasibility of each language. I think while trying to research into the imbalances in translation specifically in machine translation the most basic thing we need to know is that if there are tools available to translate that language. For instance 15 languages have less than or equal to 5 options for target language. 22 languages have less than 100 options for target languages. 73 languages have 134 or less options for target language. When the total languages that are explored in machine translation are 159. And the language with maximum number of available options in target language is es with 145 language which means it can't be translated to 14 languages. And so on.

Oh! I forgot to add the figure and repository link in the previous comment. Here they are:

translation_availability_scatter_plot.png (612×1 px, 38 KB)

Repository Link

@awight I plotted another scatter plot but this time it is from the public content translation data source. For this scatter plot also I first prepared a csv with the total count of translated_from and translated_to for each of the 326 languages. From this csv we can see that there are 183 languages that have 15 or less than 15 articles translated from them (more than 56%). Also the most number of articles are translated from English and the most number of articles are translated to Es.

translation_ratio_on_the_basis_of_article_count_all_languages.png (612×1 px, 35 KB)

In this scatter-plot we can see that the scatter plot is quite clustered. If we remove the top 10 languages from the list the scatter plot gets a little bit scattered since the article count gap is very big between the languages above the 10 position.

translation_ratio_on_the_basis_of_article_count_top_10_lang_removed.png (612×1 px, 49 KB)

The diagram looks incomprehensible here in the image but I have made function for labelling the points which prevents the language names from overlapping. So if we run the script and zoom the diagram is readable.
Repository Link

@Abhishek02bhardwaj Thanks--just to mention, the official contribution period is over, but I'm still interested in chatting about these diagrams if you wish!

The scatterplots are looking great! Just removing the top 10 languages is a valid approach and it seems to have worked. You might also try a log-log scale which (I think?) preserves the ratio but should give more detail on the low end of the scale.

If possible, I would imagine the X and Y axes should use the same scale. That might mean forcing a square aspect ratio for the diagram, rather than letting the plotting package pick the maximum extent based on the data. It makes sense that the missing 10 languages are heavy "translate from" sources, because eyeballing these graphs we know there will be the same sum for total translations from all languages and total translations to all languages, but the diagram shows mostly translations "to".

Hello @awight I tried to create a scatterplot diagram using the CSV file provided by @awight to visualize the relationship between the language translation ratio and the number of Wikipedia articles.

translation ratio vs wiki article count.png (663×1 px, 94 KB)

@Abhishek02bhardwaj Thanks--just to mention, the official contribution period is over, but I'm still interested in chatting about these diagrams if you wish!

The scatterplots are looking great! Just removing the top 10 languages is a valid approach and it seems to have worked. You might also try a log-log scale which (I think?) preserves the ratio but should give more detail on the low end of the scale.

If possible, I would imagine the X and Y axes should use the same scale. That might mean forcing a square aspect ratio for the diagram, rather than letting the plotting package pick the maximum extent based on the data. It makes sense that the missing 10 languages are heavy "translate from" sources, because eyeballing these graphs we know there will be the same sum for total translations from all languages and total translations to all languages, but the diagram shows mostly translations "to".

Hi @awight, yeah! I am aware of the contribution period deadline but I found it really interesting cause of which I wanted to keep working on it during the review period. I hope you won't mind small discussions.
Yes! you are right I should have chosen a square graph in the first place. I think it wont be that hard to implement.

Screenshot (205).png (1×1 px, 262 KB)

So the above image is a screen-shot from the csv that has the article count. The thing is that in the beginning I thought that english will be the most preferred option in both translation-to and translation-from categories but that is not the case actually it stands at a quite lower number. And if you look at the list "tr" which has only 8013 translation-from has 64571 translations-to almost double of english. So as we move from translation-from_count to translation-to_count the picture changes a little bit. Also about the languages having very less translation_count, I think we might have a clue for the reason. The cx-server repository contains the available pairs of translation combinations. Now the thing is that these available pairs contain 159 unique languages but the translation data (and also the languages.csv) contains 326 languages which I think directly implies that a big chunk of languages do not have the facility of automated translation which might be an important factor in the less number of translated articles. I am still trying to think of a possible hypothesis for why some languages have such high number of translation_to counts than translation_from counts (like "tr").

{F36952940}

Also take a look at this. This scatter-plot gives the number of available options for translation in vs translation out for each language. The languages are divided basically into four groups which implies that even for the languages which have available options for translation the number of available options are not really many which leads to a lack of enthusiasm in people to go and translate the whole article manually.

Hello @awight I tried to create a scatterplot diagram using the CSV file provided by @awight to visualize the relationship between the language translation ratio and the number of Wikipedia articles.

translation ratio vs wiki article count.png (663×1 px, 94 KB)

It looks like the Y axis is accidentally "wiki" rather than the language translation out vs. in ratio, so essentially random. This might account for why no patterns are emerging.

Exciting to see the wiki size showing up on a scatterplot! The log scale looks good, too.

So the above image is a screen-shot from the csv that has the article count. The thing is that in the beginning I thought that english will be the most preferred option in both translation-to and translation-from categories but that is not the case actually it stands at a quite lower number. And if you look at the list "tr" which has only 8013 translation-from has 64571 translations-to almost double of english. So as we move from translation-from_count to translation-to_count the picture changes a little bit. Also about the languages having very less translation_count, I think we might have a clue for the reason. The cx-server repository contains the available pairs of translation combinations. Now the thing is that these available pairs contain 159 unique languages but the translation data (and also the languages.csv) contains 326 languages which I think directly implies that a big chunk of languages do not have the facility of automated translation which might be an important factor in the less number of translated articles. I am still trying to think of a possible hypothesis for why some languages have such high number of translation_to counts than translation_from counts (like "tr").

You've uncovered strong evidence already, that integrated machine translation is driving a lot of the flow between languages. My thoughts are that analyzing the step changes when MT is enabled or disabled would show a conclusive effect. As an example, for a brief period MT was allowed when translating into English. There are also some papers recommended in T331200 which discuss the impact of machine translation on workflows and may help with modeling.

{F36952940}

A cliffhanger! This image never came through, it seems to have a non-public access setting: https://phabricator.wikimedia.org/F36952940 (you can check by going to a private browser tab). I'm curious to see the groups you described.

Hi! Please consider resolving this task and moving any pending items to a new task, as GSoC/Outreachy rounds are now over, and this workboard will soon be archived.

As Outreachy Round 26 has concluded, closing this microtask. Feel free to reopen it for any pending matters.