Page MenuHomePhabricator

Extract cxserver configuration and export to CSV
Closed, ResolvedPublic

Assigned To
None
Authored By
awight
Mar 4 2023, 6:16 PM
Referenced Files
Restricted File
Mar 19 2023, 5:01 AM
F36917931: supported_pairs.csv
Mar 18 2023, 9:04 PM
F36907464: supported_pairs.csv
Mar 12 2023, 9:42 AM
F36906942: supported_pairs.csv
Mar 11 2023, 11:52 PM
F36905478: supported_pairs.csv
Mar 10 2023, 6:42 PM
F36899664: supported_pairs.csv
Mar 9 2023, 4:22 PM

Description

Machine translation only exists for certain language pairs, and the Content Translation service only supports some of those. A set of YAML files under "config/" in the cxserver repository determines which languages are supported by the service.

Write a parser for these files and create a single flat, in-memory structure with all of the supported pairs. Export this data as a CSV (using a library such as this) of all pairs, with at least the following columns:

source languagetarget languagetranslation engineis preferred engine?
deenDeepLtrue

The configuration files have several different file structures. Most have the source as the top-level key, and target languages as a list of values under that key. Watch out for the "handler" key which indicates a non-standard interpretation for the file. Some YAML files should be ignored, currently: MWPageLoader.yaml, languages.yaml, JsonDict.yaml, Dictd.yaml and mt-defaults.wikimedia.yaml. Here is the configuration showing how the various YAML files are wired into the application—as you can see, it's safe to assume that the config file base name is the same as the translation engine name. You can filter the filenames either with an allowlist, a disallowlist, or by parsing the main configuration to find an exact list of valid files.

One possible approach would be to adapt the existing cxserver source, reusing its built-in config import, and then transforming the data once it's already loaded into memory in a more consistent structure. Another approach is to pick your favorite programming language, find a YAML library, and write the parser from scratch. This latter approach is probably going to be the simpler option.

Please consider the mt-defaults.wikimedia.yaml file and what its effect might be on the supported translation pairs and default translation engine for each pair.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Removing assignee, as several people can work on this in parallel.

@awight @Simulo @srishakatux I have tried to write some code in python to make a parser for these files and create a single flat, in-memory structure with all of the supported pairs and export this data as a CSV of all pairs, with at least the required columns. I am attaching the CSV file that I am getting as the result and also a link of the github repository in which I have added the python code (keeping it private to preserve the code privacy if that is fine).
Link to the github repository - https://github.com/Abhishek02bhardwaj/Extract-cxserver-configuration-and-export-to-CSV
The CSV file that I got as result -

the github repository in which I have added the python code (keeping it private to preserve the code privacy if that is fine).

Hi @Abhishek02bhardwaj, nice to see this contribution! I would encourage you to make the repository public as we will be working in an open-source style for this project, and also because your code might be helpful for other participants and collaborators. However, it's okay if you prefer to keep this private for now--please add GitHub users adamwight and jdittrich to the project so that we can review.

From the CSV, I can see that only one translation engine (Apertium) appears. Maybe this value is being accidentally hardcoded, or maybe the parser still needs to be extended to process all files in the directory?

@Abhishek02bhardwaj there are also some non-language values peppered throughout the CSV, such as "removableSections" and "languages", so perhaps the parser needs to become aware of the various file formats which appear in that directory, and which files should be included or excluded from processing?

@awight Thanks for taking out some time to review my contribution. I have made the repository public so now it will be easier to access by anyone and help others too to improve the quality of work.

@Abhishek02bhardwaj there are also some non-language values peppered throughout the CSV, such as "removableSections" and "languages", so perhaps the parser needs to become aware of the various file formats which appear in that directory, and which files should be included or excluded from processing?

actually while going through the YAML files I found out that some files were not in the same format as the rest. In two of the YAML files the source language was missing because of which the parser adds the file name to the source language name. I was trying to get the parser working so I kept that issue aside for the moment to address it later.

the github repository in which I have added the python code (keeping it private to preserve the code privacy if that is fine).

Hi @Abhishek02bhardwaj, nice to see this contribution! I would encourage you to make the repository public as we will be working in an open-source style for this project, and also because your code might be helpful for other participants and collaborators. However, it's okay if you prefer to keep this private for now--please add GitHub users adamwight and jdittrich to the project so that we can review.

From the CSV, I can see that only one translation engine (Apertium) appears. Maybe this value is being accidentally hardcoded, or maybe the parser still needs to be extended to process all files in the directory?

Yes the CSV has only one translation engine Apertium. Actually in the YAML files there is no key by the name engine. I think this is because the file names are the engine names (this is my guess) but I had no idea how to use the file name as the engine name so i used appertium as the engine for all translations and marked the value under the "Is preferred engine" as false.

@awight Regarding the mt-defaults.wikimedia.yaml file, it sets the default translation engines to be used for each language pair if no other engine is specified in the configuration files. This file does not define the language pairs themselves, so it does not affect the supported translation pairs. I wanted to use it in the parser but again I felt that first I should address the engine name issue.

@awight Regarding the mt-defaults.wikimedia.yaml file, it sets the default translation engines to be used for each language pair if no other engine is specified in the configuration files. This file does not define the language pairs themselves, so it does not affect the supported translation pairs. I wanted to use it in the parser but again I felt that first I should address the engine name issue.

Thanks, this brings up some excellent questions. I've tried to summarize a bit more of what we're learning about the various file formats, in the task description. More comments are attached to the commit in GitHub, thank you for publishing!

@awight I am not sure if JsonDict.yaml should also be ignored because it looks fine to me.

@awight I am not sure if JsonDict.yaml should also be ignored because it looks fine to me.

I think you're right—it seems to be important, although to a slightly different service than bulk machine translation. The source code calls the concept a "dictionary" and I think it's for translating one word at a time. Here it's first loaded into config: https://github.com/wikimedia/mediawiki-services-cxserver/blob/master/lib/Config.js#L38 . Ultimately this seems to be published under the /dictionary route: https://github.com/wikimedia/mediawiki-services-cxserver/blob/master/lib/routes/v2.js#L28 available on request from a user's browser for the a dictionary tool. Documentation is here: https://cxserver-beta.wmcloud.org/v2?doc#!/Dictionary/get_v1_dictionary_word_from_to_provider . Words can be requested like this: https://cxserver-beta.wmcloud.org/v2/dictionary/cocer/es/ca/JsonDict . My outsider's reading of the code suggests that dictionaries are only used along with "section translation". The testing site, when given ca -> es: https://test.m.wikipedia.org/w/index.php?title=Special:ContentTranslation&from=ca&to=es and https://test.m.wikipedia.org/w/index.php?title=Special:ContentTranslation&from=ca&to=es&page=Hist%C3%B2ria%20colonial%20d%27Am%C3%A8rica%20del%20Nord&sx=true#/sx/sentence-selector should expose the dictionary somehow, but I don't see where this is yet.

Nevertheless, I think the outcome is that JsonDict and Dictd dictionaries *might* be important or might become important, but we should consider these separately from bulk machine translation which is called "mt" in the source code.

@awight I am not sure if JsonDict.yaml should also be ignored because it looks fine to me.

I think you're right—it seems to be important, although to a slightly different service than bulk machine translation. The source code calls the concept a "dictionary" and I think it's for translating one word at a time. Here it's first loaded into config: https://github.com/wikimedia/mediawiki-services-cxserver/blob/master/lib/Config.js#L38 . Ultimately this seems to be published under the /dictionary route: https://github.com/wikimedia/mediawiki-services-cxserver/blob/master/lib/routes/v2.js#L28 available on request from a user's browser for the a dictionary tool. Documentation is here: https://cxserver-beta.wmcloud.org/v2?doc#!/Dictionary/get_v1_dictionary_word_from_to_provider . Words can be requested like this: https://cxserver-beta.wmcloud.org/v2/dictionary/cocer/es/ca/JsonDict . My outsider's reading of the code suggests that dictionaries are only used along with "section translation". The testing site, when given ca -> es: https://test.m.wikipedia.org/w/index.php?title=Special:ContentTranslation&from=ca&to=es and https://test.m.wikipedia.org/w/index.php?title=Special:ContentTranslation&from=ca&to=es&page=Hist%C3%B2ria%20colonial%20d%27Am%C3%A8rica%20del%20Nord&sx=true#/sx/sentence-selector should expose the dictionary somehow, but I don't see where this is yet.

Nevertheless, I think the outcome is that JsonDict and Dictd dictionaries *might* be important or might become important, but we should consider these separately from bulk machine translation which is called "mt" in the source code.

Okay. I will treat these files separately.

I also understand that Using CSV files can make it easier to organize and analyze data related to the project, and can also make it easier to collaborate with other contributors who may be working on different aspects of the project. However, it's important to keep in mind that CSV files can be complex to work with, particularly for contributors who are not familiar with data analysis or programming. Therefore, it may be important to provide guidance and support to contributors who are working with CSV files and to ensure that data is stored and exchanged in a standardized format that can be easily understood and used by everyone involved in the project.

@awight I have updated the github repo and made some changes in the parser to accommodate the changes that you suggested. The following changes can be seen in the CSV file:

  1. The engine name is now the name of the file just how it was supposed to be.
  2. I have removed the unwanted handlers that I added in while testing.
  3. I have excluded the files that were supposed to be ignored.

The following are the issues that I am yet to address:

  1. The parser takes into consideration only the first source language of the file (since i hardcoded that while testing). We needed all the source languages and their respective target languages. To accomplish the same I will include the code snippet that was accessing the target languages for the source language in a while loop and use a try and except to handle the error that might arise at the end of the list. I am aware that this explanation might not be sufficient to explain what I am trying to do but I just wanted to keep it here since it might help someone else too.
  2. I am yet to understand how to use the transform.js handler. @awight I needed a bit of help regarding that. I am not really sure about how I can use the handler.js file to get the source and target language pairs from Google.yaml and Yandex.yaml. I would really appreciate if you could guide me over that.

link of the repo (just to make it easier to access) - https://github.com/Abhishek02bhardwaj/Extract-cxserver-configuration-and-export-to-CSV
The Updated CSV file -

@awight @srishakatux @Simulo The first of the two issues I listed in my previous comment has been addressed in my most recent commit to the repository. Now all the language pairs of the respective engines are there in the CSV file. Just to make it more accessible I am adding the CSV file here and also the github repository link. Now the only issue left to address in this task is how to use the handler.js file to access the source and target languages. I am still trying to figure that out and would really appreciate any kind of help.
Github Repository Link - https://github.com/Abhishek02bhardwaj/Extract-cxserver-configuration-and-export-to-CSV
Updated CSV File -

hey, @Abhishek02bhardwaj can you tell me about the issue in detail? where exactly you are facing issue?

@Abhishek02bhardwaj I think to use the handler.js file to access the source and target languages, you would need to modify the code in the supported_pairs.py file to include the logic to parse the handler.js files. Please correct me if I am wrong.

hey, @Abhishek02bhardwaj can you tell me about the issue in detail? where exactly you are facing issue?

@Anshika_bhatt_20 Hi, Anshika. Actually the parser is almost done the only thing left is to deal with the two YAML files which are not using the standard configuration and instead use a handler file "Transform.js".
"Transform.js" is a JavaScript file that exports a class called TransformLanguages using module.exports.
The first line of the file is "use strict"; which enables strict mode in JavaScript, providing better error handling and preventing certain types of mistakes.
The TransformLanguages class takes a configuration object as a parameter in its constructor. The configuration object has two properties: languages and notAsTarget.
The languages property is an array of language codes that will be used to create a matrix of languages. The notAsTarget property is an optional array of language codes that should not be included as target languages in the matrix.
The class has a getter method called languages which creates and returns a matrix of languages. The matrix is an object with keys as the language codes from the languages property and values as arrays of language codes that are not the same as the key and are not included in the notAsTarget property. The englishVariants array is used to exclude certain variants of English from being included as target languages for each other.
Finally, the TransformLanguages class is exported so that it can be used in other modules of a JavaScript application.
This is all of my understanding of the "Transform.js". I am trying to find a way to use this handler in the parser to get the target and source language for the Google.yaml and Yandex.yaml files.

@Abhishek02bhardwaj I think to use the handler.js file to access the source and target languages, you would need to modify the code in the supported_pairs.py file to include the logic to parse the handler.js files. Please correct me if I am wrong.

@Anshika_bhatt_20 You mean the "first working test.py". Yes, in the parser code i have handled the two types of files separately:

  1. Files which are using the standard configuration (They have been dealt with and are included in the CSV file)
  2. Files which are not using the standard configuration and instead use the handler file "Transform.js" to deal with the format.

I am trying to figure out the logic to use the handler file.

hey, @Abhishek02bhardwaj can you tell me about the issue in detail? where exactly you are facing issue?

@Anshika_bhatt_20 Hi, Anshika. Actually the parser is almost done the only thing left is to deal with the two YAML files which are not using the standard configuration and instead use a handler file "Transform.js".
"Transform.js" is a JavaScript file that exports a class called TransformLanguages using module.exports.
The first line of the file is "use strict"; which enables strict mode in JavaScript, providing better error handling and preventing certain types of mistakes.
The TransformLanguages class takes a configuration object as a parameter in its constructor. The configuration object has two properties: languages and notAsTarget.
The languages property is an array of language codes that will be used to create a matrix of languages. The notAsTarget property is an optional array of language codes that should not be included as target languages in the matrix.
The class has a getter method called languages which creates and returns a matrix of languages. The matrix is an object with keys as the language codes from the languages property and values as arrays of language codes that are not the same as the key and are not included in the notAsTarget property. The englishVariants array is used to exclude certain variants of English from being included as target languages for each other.
Finally, the TransformLanguages class is exported so that it can be used in other modules of a JavaScript application.
This is all of my understanding of the "Transform.js". I am trying to find a way to use this handler in the parser to get the target and source language for the Google.yaml and Yandex.yaml files.

Yes, that's correct. The code already has some handling for transform.js files, but it might need to be modified or extended to handle the specific cases in those files. To handle transform.js files, the code could modify the existing "handler" block to include additional logic that handles the specific cases in these files.

hey, @Abhishek02bhardwaj can you tell me about the issue in detail? where exactly you are facing issue?

@Anshika_bhatt_20 Hi, Anshika. Actually the parser is almost done the only thing left is to deal with the two YAML files which are not using the standard configuration and instead use a handler file "Transform.js".

"Transform.js" is a JavaScript file that exports a class called TransformLanguages using module.exports.
The first line of the file is "use strict"; which enables strict mode in JavaScript, providing better error handling and preventing certain types of mistakes.
The TransformLanguages class takes a configuration object as a parameter in its constructor. The configuration object has two properties: languages and notAsTarget.
The languages property is an array of language codes that will be used to create a matrix of languages. The notAsTarget property is an optional array of language codes that should not be included as target languages in the matrix.
The class has a getter method called languages which creates and returns a matrix of languages. The matrix is an object with keys as the language codes from the languages property and values as arrays of language codes that are not the same as the key and are not included in the notAsTarget property. The englishVariants array is used to exclude certain variants of English from being included as target languages for each other.
Finally, the TransformLanguages class is exported so that it can be used in other modules of a JavaScript application.
This is all of my understanding of the "Transform.js". I am trying to find a way to use this handler in the parser to get the target and source language for the Google.yaml and Yandex.yaml files.

To handle the Google.yaml and Yandex.yaml files, which use the Transform.js handler, you can use the PyYAML library to parse YAML files and extract the relevant information.

@Abhishek02bhardwaj I think to use the handler.js file to access the source and target languages, you would need to modify the code in the supported_pairs.py file to include the logic to parse the handler.js files. Please correct me if I am wrong.

@Anshika_bhatt_20 You mean the "first working test.py". Yes, in the parser code i have handled the two types of files separately:

  1. Files which are using the standard configuration (They have been dealt with and are included in the CSV file)
  2. Files which are not using the standard configuration and instead use the handler file "Transform.js" to deal with the format.

I am trying to figure out the logic to use the handler file.

To access the source and target languages in files that use the transform.js handler, you will need to parse the file using the js2py library. I hope this resolved your issue. Let me know if this helps or not.

@Abhishek02bhardwaj I think to use the handler.js file to access the source and target languages, you would need to modify the code in the supported_pairs.py file to include the logic to parse the handler.js files. Please correct me if I am wrong.

@Anshika_bhatt_20 You mean the "first working test.py". Yes, in the parser code i have handled the two types of files separately:

  1. Files which are using the standard configuration (They have been dealt with and are included in the CSV file)
  2. Files which are not using the standard configuration and instead use the handler file "Transform.js" to deal with the format.

I am trying to figure out the logic to use the handler file.

To access the source and target languages in files that use the transform.js handler, you will need to parse the file using the js2py library. I hope this resolved your issue. Let me know if this helps or not.

@Anshika_bhatt_20
Ohkay thank you for the advice. I'll definitely try it. Right now I have converted the transform.js handler into a python file(that i wrote myself), I think it should work.

@awight I have updated the github repository with the updates in code and the CSV file. To address the last issue left I have used a slightly different approach. Instead of parsing the "transform.js" file from the config folder and then using it to generate the target and source language pairs, what I have done is that i have used the logic of "transform.js" directly in my code. This gives us two benefits:

  1. It reduces the compile and execution time (though very slightly) since I didn't have to import another library to use the Javascript file.
  2. It keeps the code simple to understand (or at least i hope so).

I would be really grateful if you could take a look at the repository and share your reviews. Thank you.
Github Repository Link - https://github.com/Abhishek02bhardwaj/Extract-cxserver-configuration-and-export-to-CSV
Updated CSV File -

@Anshika_bhatt_20 Hey Anshika, do you mind taking a look at the repository and and the CSV file and sharing your views. I'll be really thankful.

@Anshika_bhatt_20 Hey Anshika, do you mind taking a look at the repository and and the CSV file and sharing your views. I'll be really thankful.

@Abhishek02bhardwaj Overall, the code looks good and should work as expected. but it would be always a good idea to ask @awight and @Simulo for advice on this issue. They have more experience with this and can provide more guidance. Good luck!

I also understand that Using CSV files can make it easier to organize and analyze data related to the project, and can also make it easier to collaborate with other contributors who may be working on different aspects of the project. However, it's important to keep in mind that CSV files can be complex to work with, particularly for contributors who are not familiar with data analysis or programming. Therefore, it may be important to provide guidance and support to contributors who are working with CSV files and to ensure that data is stored and exchanged in a standardized format that can be easily understood and used by everyone involved in the project.

Interesting point. I agree that CSV is not a great format and has many non-standard quirks, but I think that its shortcomings can be mostly ignored for this task since CSV can be written using a library such as Python's builtin csv which hides the details. Also, we're not reading the data yet so 100% compatibility is unnecessary, and the text values will be very simple so we don't have to deal with issues such as how to store a literal comma (usually with quotes like: a,"1,2",b).

Your comment is a helpful reminder that I should add "use a library" to the task description, thank you!

Change 897802 had a related patch set uploaded (by Awight; author: Awight):

[mediawiki/services/cxserver@master] Write test for transform.js

https://gerrit.wikimedia.org/r/897802

I have converted the transform.js handler into a python file(that i wrote myself)

This sounds like a good approach, to reverse-engineer the JS logic and reimplement it. The resulting code will be readable, flexible, reusable, and testable. Best of all, in order to port the logic we must understand how it works.

Which brings up the only risk, that we might have misunderstood some detail of the original logic. Ideally the cxserver repo would have included some configuration tests and fixtures which illustrate usage and edge cases handling. I've written a small patch to do that. Please see this file, which exercises the logic: https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/897802/1/test/mt/transform.test.js .

I think you've ported the logic correctly, and "notAsTarget" doesn't matter for this task because it's currently not used in any of the config files.

@awight Since "notAsTarget" hasn't been used in the files, Is it okay ignore it for now.
I wanted to ask one more thing. Is it okay now if I record my submission of this task on the Add a Contribution page of Outreachy and should I mark it accepted/merged or not.

I wanted to ask one more thing. Is it okay now if I record my submission of this task on the Add a Contribution page of Outreachy and should I mark it accepted/merged or not.

Yes, please do record your contribution in outreachy.org . I'll have to get back to you about the accepted/merged question, we haven't done this part of the process for any contributions yet and I want to be sure that we apply the same criteria to everyone. Great work making it through this task and helping us adapt to complications such as unused config files.

I wanted to ask one more thing. Is it okay now if I record my submission of this task on the Add a Contribution page of Outreachy and should I mark it accepted/merged or not.

Yes, please do record your contribution in outreachy.org . I'll have to get back to you about the accepted/merged question, we haven't done this part of the process for any contributions yet and I want to be sure that we apply the same criteria to everyone. Great work making it through this task and helping us adapt to complications such as unused config files.

@awight Thank you very much. It was really interesting working on this task and thanks to your valuable mentorship I learned a lot of new things in the process.

Hello, @awight and @Simulo.

I just updated the README file of my solution for this task and would appreciate your feedback, if possible. Thanks in advance for your time.

Repository: https://github.com/ahn-nath/wikimedia-csv-parser

@awight, it would be nice to keep Language-Team informed about the initiative here. CX-cxserver is an active project by our team.
cc @Pginer-WMF

I am not completely sure about the problem we are trying to solve here, but https://cxserver.wikimedia.org/v2/list/mt exposes cxserver's MT capabilities via an API with json output. This output would be the true source for production as the config files are amended in deployment by production configurations. https://cxserver.wikimedia.org/v2?doc is the API spec for cxserver.

Hi @santosh, thanks for the note! I'd love to chat in Slack or on a call whenever you have the time. The context of this work is a small research project, see the parent task here and on metawiki. We're trying to understand what's driving choice of language pairs for translation, whether it's an organic choice or something related to the software or suggestion algorithm. I'm also hoping to learn about machine translation which is the subgoal of this task and T331202: Configuration evolution over time, to reverse-engineer the historical configuration of MT availability in the software to see if there might be a "natural experiment" we can learn from, for example any step changes in translation behavior around the time that MT becomes available for a lanuage pair.

To be clear, none of the work for this project is intended to change cxserver directly. Some of the potential outcomes of the project might be recommendations for software change or an experimental design for an intervention that shifts the relative number of translations between languages, but +1 of course your team would make the final decisions about how to proceed.

I am not completely sure about the problem we are trying to solve here, but https://cxserver.wikimedia.org/v2/list/mt exposes cxserver's MT capabilities via an API with json output. This output would be the true source for production as the config files are amended in deployment by production configurations. https://cxserver.wikimedia.org/v2?doc is the API spec for cxserver.

Great, thank you for these pointers! I'll start looking at how production overrides are made.

Hello @awight and @Simulo. Here is my submission for the task GitHub repository. Thanks!

Hello, @awight and @Simulo This is my submission for this task https://github.com/anshikabhatt/Extract-cxserver-configuration-and-export-to-CSV. Please have a look and give me your valuable feedback on this. Thanks in advance.

@awight It took me while but I was finally able to use the functionality of mt-defaults.wikipedia.yaml in the parser. Now the value of "is preferred engine?" is not all false by default and I have one less file to ignore. Please have a look at the repository and share your views. Thanks in advance.
Github Repository Link
Updated CSV File -

I learned a lot from reading through the different implementations, thank you everyone who has worked on the problem so far! I would suggest reading through each other's contributions if you haven't already.

For those who haven't included tests, the question I like to ask about my own code is how confident I am that it does what I expect. If the logic is trivial, maybe a test is unnecessary. But as soon as there's even one edge case or slightly tricky question, I don't have much faith in the code unless there are tests. Writing tests also offers some nice benefits such as making it easy to clean up the code (you can check that your rewrite didn't break anything, without having to review in detail), and helping with structural issues (eg. hardcoded production values, long functions that are hard to test so must be broken into more logical units). There's no "credit" involved, I'm mostly just offering a chance to play more with the code if you find it interesting, and get review.

One issue that jumped out is that I should have been more explicit about what mt-defaults.wikimedia.yaml is doing here. My understanding is that the default is only necessary if multiple translation engines exist for a language pair, and even when multiple engines are available, there isn't necessarily a default. I think the table as specified will be good enough, "is preferred engine" can be answered by whether the engine is listed as a default. In the future we can do an additional pass over the list which also calculates answers for "is this the *only* engine" and "are there several engines, but none are marked as default", but that's out of scope for this task.

Thank you so much for your guidance and your valuable feedback @awight

@awight, @Simulo

Hello,

Thank you for the valuable feedback I received on my repository. I will be working on all suggestions so that I can finish my Outreachy contribution. I have made the code private until all of my work is done on both repositories (for two different tasks) and added you (both) as contributors.

@awight Thank you so much for your time and guidance. I wanted to let you know that I made the changes you suggested in the code. I have also added the necessary tests to ensure that the code is working as expected. Could you please take a look and let me know if there are any further changes I should make? Thank you for your feedback, it was very helpful.

Hello @Simulo and @awight , I tried to do this task to record my contribution with approach of read all the .yaml files in the config directory of the cxserver repository, except for the files listed in the question. For each file, it reads the YAML data, extracts the supported language pairs, and appends them to the supported_pairs list and then stored it in csv file .
Here is my contribution link https://github.com/akanshajais/Extract-cxserver-configuration-and-export-to-CSV Looking forward for you feedback and suggestions.
I am also attaching my resulted CSV file .
{F36918181}

Hello @awight and @Simulo Thank you for your guidance. As you mentioned in your earlier comment, I have included tests in my repository for my program. I would be really thankful if you could take out some time to review them. I am attaching the Github repository link for your kind reference.
Github Repository Link
There is one more help I wanted to ask for. I wanted to discuss the prospective Outreachy internship project timeline with the mentors but I don't know how should I do it. So I thought maybe I could make a private Github repository and add our mentors there as contributors so that they can review it and give their valuable suggestions and guidance if that is okay. Also is there any specific format or examples that the mentors would like us to follow, I am fairly new to writing proposals and prospective timelines and hence would really appreciate if the mentors could guide me over how should I start.
Also are there any community specific questions that the mentors would like us to answer.
Thank you in advance.

Hello @Simulo and @awight , I tried to do this task to record my contribution with approach of read all the .yaml files in the config directory of the cxserver repository, except for the files listed in the question. For each file, it reads the YAML data, extracts the supported language pairs, and appends them to the supported_pairs list and then stored it in csv file .
Here is my contribution link https://github.com/akanshajais/Extract-cxserver-configuration-and-export-to-CSV Looking forward for you feedback and suggestions.
I am also attaching my resulted CSV file .
{F36918181}

Hello @JaisAkansha while going through your repository I found something that I dealt with while doing the task so I thought maybe I could highlight that for you. Actually the files that the parser has to parse include two files (Google.yaml and Yandex.yaml) which do not follow the standard configuration instead of that they use a handler by the name "transform.js" as you have handled in your code. In the two files the list of source and target languages is not given as it is given in the other files instead there is only a single list of languages and a handler. For these files the source language and target language pairs are generated by simple cross product. The handler makes sure of three things:

  1. That English is not translated into other variant of english (simple)
  2. The simple cross product of languages is generated correctly.
  3. "notAsTarget" languages are not there in the target language list.

To achieve this function I reverse engineered the logic of transform.js and applied it directly into my code. You can check that in my Github repository if you want to. You might want to try a different approach for achieving the task that I would really look forward to learn.
Besides this there is the mt-defaults.wikimedia.yaml file that contains the preferred engine values for different source and target language pairs using which you might want to update the value of the "Is preferred engine section?" of your supported_language_pairs.csv file.
And you might want to add a test to your code.
Hope this helps :)

Regarding the mt-defaults.wikimedia.yaml file, this file contains default settings for machine translation engines, such as the default engine to use for a given language pair if no specific engine is specified. This file does not directly affect the list of supported translation pairs, but it may indirectly affect it by changing the default engine for certain pairs. For example, if the default engine for the English-French pair is changed from Google Translate to Microsoft Translator, and no specific engine is specified for a given translation request, then the translation will use Microsoft Translator instead of Google Translate.

@awight and @Simulo a review and feedback will really be appreciated as I had to make sure I participate in this particular task. This is a link to my repo

https://github.com/Kachiiee/outreachy_cxserver_config_extraction.git

After so much struggling with finding my way around this particular task, I finally came up did something

I would appreciate a feedback on my trial @awight and @Simulo

@awight and @Simulo a review and feedback will really be appreciated as I had to make sure I participate in this particular task. This is a link to my repo

https://github.com/Kachiiee/outreachy_cxserver_config_extraction.git

@Kachiiee Hi! That's really awesome work. I wonder why I didn't think about creating different CSV files for different engines, I think I should try doing that now. One thing I wanted to point out, there is a bit of a miscellaneous case in the YAML files. Actually when the source or target language is "no" the reader takes it as "False" and writes False in place of the language name. I checked the CSV file you uploaded in your repository, I think the same problem is occurring there also. So you might want to check that. Also it would be a nice idea to also include a test.

Thanks for the feedback. I'll make some adjustments to that.

Please how do we get a review on of our timeline as the deadline is fast appreciate

@Kachiiee

  • I love the verbose readme!
  • You could add a tiny note about which libraries are required, I think it's just pip3 install PyYAML ?
  • Very nice that modules are split into separate files, this helps prevent accidental leakage between concepts and demands some attention to interfaces between modules.
  • On that note, I would suggest structuring fileparser.py more like the other modules, in other words putting all the logic into functions so that loading the source code file doesn't cause anything to execute. This makes the code testable in the future. A normal Python idiom is to end the file like so:
if __name__ == '__main__':
    main(sys.argv)

That way, running "python3 fileparser.py" will run the code as before, but test code can load the module without causing side-effects, can run the main module with various arguments, etc.

  • I could be wrong, but I think the loop in transform.py could be slightly simplified to for lang in self.langs: .
  • Splitting the file on "." could be made a bit more robust (eg. against a file with two dots like "foo.engine.yaml") by using os.path.basename and os.path.splitext.

Please don't worry about making changes unless it would be fun, it's not a requirement for the program...

Hello @Simulo and @awight , I tried to do this task to record my contribution with approach of read all the .yaml files in the config directory of the cxserver repository, except for the files listed in the question. For each file, it reads the YAML data, extracts the supported language pairs, and appends them to the supported_pairs list and then stored it in csv file .
Here is my contribution link https://github.com/akanshajais/Extract-cxserver-configuration-and-export-to-CSV Looking forward for you feedback and suggestions.
I am also attaching my resulted CSV file .
{F36918181}

  • I like that most of the magic constants are pulled up into global variables.
  • The file path "C:/Users/..." should also be extracted into a constant, or even better come from the command line.
  • filename operations could be supported by os.path, eg. basename and splitext.
  • I can't tell what the "_to_" logic is doing, but I think it's supposed to emulate the "config/transform.js" script? In case it's helpful, here's a test case that shows what this file does: https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/897802/1/test/mt/transform.test.js , this is applied to any of the configs with handler: transform.js.
  • mt-defaults.json should be considered

The explanation of the algorithm in the readme is interesting, it's a neat idea although a bit difficult to connect to the code. The trend has been towards so-called self-documenting code, which doesn't necessarily mean adding comments but means that the code reads as much as possible like what it does, for example: supported_language_pairs = parse_all_yaml_config_files(config_directory) .

@awight, thanks for the feedback. Even If it's not a requirement in the program, I will make all adjustments needed as it would help me learn more and how to structure my works well

Change 897802 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Write test for transform.js

https://gerrit.wikimedia.org/r/897802

Change 923291 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2023-05-25-093623-production

https://gerrit.wikimedia.org/r/923291

Change 923291 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2023-05-25-093623-production

https://gerrit.wikimedia.org/r/923291

Mentioned in SAL (#wikimedia-operations) [2023-05-25T10:00:09Z] <kart_> Updated cxserver to 2023-05-25-093623-production (config: language pairs transform fix + T331201)

Hi! Please consider resolving this task and moving any pending items to a new task, as GSoC/Outreachy rounds are now over, and this workboard will soon be archived.

As Outreachy Round 26 has concluded, closing this microtask. Feel free to reopen it for any pending matters.