Page MenuHomePhabricator

Compare config scraper output with config API
Closed, ResolvedPublic

Description

In T331201#8690623, @santhosh shared documentation for a public API to read configuration,

https://cxserver.wikimedia.org/v2/list/mt exposes cxserver's MT capabilities via an API with json output. This output would be the true source for production as the config files are amended in deployment by production configurations. https://cxserver.wikimedia.org/v2?doc is the API spec for cxserver.

We want to see whether information is lost or changed by the config scraper, and one way to do that is to compare the API result with scraper output. The data needs to be transformed into the same shape in one direction or another to be compared. This is a one-time operation and only a small amount of reusable logic, so we don't care which direction the transformation is in.

  • Read and parse JSON from the cxserver mt endpoint.
  • Select one of the CSV output files included in contributions for T331201: Extract cxserver configuration and export to CSV and download it to your machine, either by cloning the repository or from the web using GitHub's "raw" mode.
  • Transform data so it has the same shape. Note that sort order may also affect comparability.
  • Compare the configuration structures.
    • We don't need a detailed list of the differences if any, just an overview of what you see.

Nice to haves:
If there are differences, can they be explained by something in cxserver source code? By a quirk of the scraper?

Event Timeline

Hello, @awight I'm working on the task to compare data from two sources in our project, but I've encountered a couple of issues that I'm not sure how to handle.

Firstly, one of the DataFrames is missing a column called 'is_preferred_engine'. I'm not sure if I should remove this column from the other DataFrame or consider it as a difference between the two sources. Could you please clarify what would be the best approach in this case?

Secondly, the column names in the two DataFrames are slightly different. One DataFrame has column names with underscores (e.g. 'source_language', 'target_language', 'translation_engine') while the other has column names without underscores (e.g. 'source language', 'target language', 'translation engine'). I'm not sure if I should rename the columns, add or remove anything to change the structure, or if these differences should be considered as errors. Could you please advise me on what would be the best way to handle this situation?

Also, I am new to programming, so I'm having some trouble understanding how to proceed with these issues. Any guidance or suggestions you could provide would be greatly appreciated. Thank you for your help!"

Hello, @awight, @Simulo.

I used two approaches to observe differences between files, I will explain both and an overview of the differences:

Approach with human/superficial observation
In order to compare both files, the API result with the scraper output from my contribution,

  • I have started by observing the file superficially with a viewer: https://jsonviewer.stack.hu/. After it I proceeded to transform the JSON file to its equivalent CSV file with a function.
  • To adapt the shape, I added the preferred engine column.

Overview of the differences with human/superficial observation:

  • False positive result: The “defaults” engine and their respective pairs were included in the API result file. If you do not change the shape of the parsed file and the corresponding CSV column for the preferred engine this may count as a difference.

    The explanation for this difference: that particular configuration file corresponds to the (https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/cxserver/+/refs/heads/master/config/mt-defaults.wikimedia.yaml). Alone the “defaults” engine has 188 pairs. This key should be ignored because it is not describing supported pairs but the preferred engines for those pairs.

Approach with machine/testable observation
Since observing differences beyond that was error-prone and difficult, I decided to create a function that would compare the differences between the two files line by line and also used the library csv-diff with the command line to compare both results. I used the following functions:

  • I have one function that reads two given files, the transformed API result, and a given scraper output file, and checks if the lines in the second file are present and equal to the first file. It returns a count of the different lines, and it also records the different lines.
  • For a general comparison, I have another function that reads from a files directory and prints the count of different lines and the closeness or accuracy of the given scraper output files in relation to the API result.

Overview of the differences with machine/testable observation:
Individual case:

  1. Ahnnath

I decided to use the same function that compares the API result parsed file (CSV version) and a given CSV file, to compare scraper output from other contributions, and these were the results and test how biased my function could be.

General comparison:

  1. @Emile-Daisy
  1. @Abhishek02bhardwaj
  1. @JaisAkansha
  1. @Anshika_bhatt_20
  1. @LeilaKaltouma.

You may check the repository for the utils functions and check the output results if interested:
Output results: https://github.com/ahn-nath/wikimedia-cxserver-config-parser/blob/main/compare_files/output_results.csv
Functions to parse and compare files: https://github.com/ahn-nath/wikimedia-cxserver-config-parser/blob/main/utils.py

Hello, @Anshika_bhatt_20 . I hope to bring some guidance with my understanding of the task so may at least move forward before the mentors respond.

Firstly, one of the DataFrames is missing a column called 'is_preferred_engine'. I'm not sure if I should remove this column from the other DataFrame or consider it as a difference between the two sources. Could you please clarify what would be the best approach in this case?

It depends. The instructions indicate that you need to adjust the data shape so that both datasets can be compared. This means that, depending on the direction you decided to take, dropping or adding columns may be necessary so that the comparison is on a middle ground. For example, if you transform from the JSON file, then you would need to drop the preferred engine column and add it as a key to the transformed CSV file so that it matches the JSON structure (the “defaults” key would be the equivalent of the preferred engine column). If you transform from the CSV file, then you would need to figure out a way to include the “preferred engines” column so that it matches the structure of the CSV file.

Secondly, the column names in the two DataFrames are slightly different. One DataFrame has column names with underscores (e.g. 'source_language', 'target_language', 'translation_engine') while the other has column names without underscores (e.g. 'source language', 'target language', 'translation engine'). I'm not sure if I should rename the columns, add or remove anything to change the structure, or if these differences should be considered errors. Could you please advise me on what would be the best way to handle this situation?

The task specifies that we “want to see whether the information is lost or changed by the config scraper.” Information conveys meaning, so, I would take it as it is the focus of the task.

The header row is not as relevant for data comparison, because, semantically (their meaning) is the same. So you can either normalize it so that it does not affect the comparison (rename it so that they are the same) or completely ignore/skip it if you know the header row is not relevant for data comparison. Another example is that some people may put “False” and some “false” or even “0” as a value of the preferred engine column. You may count it as a difference and mention it, but overall, besides that data representation, the rest of the values and the line can mean the same ---> "nl,en,Google,False" is semantically equivalent to "nl,en,Google,false", but not literally equal, so that is worth noting.

Also, I am new to programming, so I'm having some trouble understanding how to proceed with these issues. Any guidance or suggestions you could provide would be greatly appreciated. Thank you for your help!"

I would download the https://cxserver.wikimedia.org/v2/list/mt file and use a JSON viewer to explore the data structure. Then focus on adjusting that structure to a CSV version of my contribution scraper output or someone else's contribution. That way you can design or use a parser that makes the “JSON” file have the same structure and format as my CSV file. After you saw that they have the same structure/shape and column and row order, then you can start observing differences and asking questions. For example, is data missing, do they have the same number of rows, and are the preferred engines specified in the “defaults” key matching the preferred engine assignment of my language pairs? You can do it the other way around with the CSV file.

Hello @awight and @Simulo, Here is my code repo for this task. As @Ahn-nath mentioned in her analysis/ comparison, the output CSV file I was getting was not 100% accurate. In some target and source language pairs where either the target or the source language was "no" the value was taken as False. I have corrected the miscellaneous error and the CSV file now I am getting is giving an accuracy of 100%.
Here is my github repository that includes the code as well as the final output.
Github Repository.

Out of curiosity I did not stop at checking the accuracy/match percentage of my output CSV file only. I did that for some other files as well that were contributed to the Task #T331201. I am listing my observations below.

  1. @Abhishek02bhardwaj :

File compared: supported_pairs.csv
Accuracy/Match % : 100%
The earlier version of this file which @Ahn-nath used to compare had a small glitch. I have fixed it and now it is exactly same as the API result.

  1. @Ahn-nath :

File compared: cx_server_parsed.csv
Accuracy/ Match % : 100%

  1. @JaisAkansha :

File compared: supported_language_pairs.csv
Accuracy/Match % : 0.85%
I think the reason for the dissimilar results is that@JaisAkansha is yet to handle the handler files in her code due to which all the files of Google.yaml and Yandex.yaml are not matching (even the ones which are included since their source language is not correct).

  1. @Anshika_bhatt_20

File compared:supported_pairs.csv
Accuracy/Match % : 99.04%
the reason for the little gap in match% is that @Anshika_bhatt_20 is yet to address the problem where the scraper takes "no" in source language or target language as "False".

  1. @Emile-Daisy

File compared: supported_language_pairs.csv
Accuracy/Match% : 0.85%
The reason for this is also the same as mentioned for @JaisAkansha 's CSV file.

  1. @LeilaKaltouma

File comapred: langs.csv
Accuracy/ Match %: 100%

Hello, @Anshika_bhatt_20 . I hope to bring some guidance with my understanding of the task so may at least move forward before the mentors respond.

Firstly, one of the DataFrames is missing a column called 'is_preferred_engine'. I'm not sure if I should remove this column from the other DataFrame or consider it as a difference between the two sources. Could you please clarify what would be the best approach in this case?

It depends. The instructions indicate that you need to adjust the data shape so that both datasets can be compared. This means that, depending on the direction you decided to take, dropping or adding columns may be necessary so that the comparison is on a middle ground. For example, if you transform from the JSON file, then you would need to drop the preferred engine column and add it as a key to the transformed CSV file so that it matches the JSON structure (the “defaults” key would be the equivalent of the preferred engine column). If you transform from the CSV file, then you would need to figure out a way to include the “preferred engines” column so that it matches the structure of the CSV file.

Secondly, the column names in the two DataFrames are slightly different. One DataFrame has column names with underscores (e.g. 'source_language', 'target_language', 'translation_engine') while the other has column names without underscores (e.g. 'source language', 'target language', 'translation engine'). I'm not sure if I should rename the columns, add or remove anything to change the structure, or if these differences should be considered errors. Could you please advise me on what would be the best way to handle this situation?

The task specifies that we “want to see whether the information is lost or changed by the config scraper.” Information conveys meaning, so, I would take it as it is the focus of the task.

The header row is not as relevant for data comparison, because, semantically (their meaning) is the same. So you can either normalize it so that it does not affect the comparison (rename it so that they are the same) or completely ignore/skip it if you know the header row is not relevant for data comparison. Another example is that some people may put “False” and some “false” or even “0” as a value of the preferred engine column. You may count it as a difference and mention it, but overall, besides that data representation, the rest of the values and the line can mean the same ---> "nl,en,Google,False" is semantically equivalent to "nl,en,Google,false", but not literally equal, so that is worth noting.

Also, I am new to programming, so I'm having some trouble understanding how to proceed with these issues. Any guidance or suggestions you could provide would be greatly appreciated. Thank you for your help!"

I would download the https://cxserver.wikimedia.org/v2/list/mt file and use a JSON viewer to explore the data structure. Then focus on adjusting that structure to a CSV version of my contribution scraper output or someone else's contribution. That way you can design or use a parser that makes the “JSON” file have the same structure and format as my CSV file. After you saw that they have the same structure/shape and column and row order, then you can start observing differences and asking questions. For example, is data missing, do they have the same number of rows, and are the preferred engines specified in the “defaults” key matching the preferred engine assignment of my language pairs? You can do it the other way around with the CSV file.

Thank you for your response, @Ahn-nath Your guidance is helpful and gives me a better understanding of how to approach the issues.

Hello, @awight and @Simulo Here is my submission for this task Github repository. I have checked the accuracy of all the other CSV files mentioned in task #T331201 As @Abhishek02bhardwaj and @Ahn-nath mentioned that my CSV file was not getting 100% accuracy, so I went back and fixed the code. Here are my observations for the following CSV files:

  1. @Anshika_bhatt_20

CSV file compared: supported_pairs.csv
Accuracy percentage: 100%

  1. @Ahn-nath

CSV file compared: cx_server_parsed.csv
Accuracy percentage: 100%

  1. @JaisAkansha

CSV file compared: supported_language_pairs.csv
Accuracy percentage: 0.85%
Reason: Upon analyzing the results, it seems that there is a disparity in the accuracy due to the handler files not being
handled by @JaisAkansha 's code. This is resulting in mismatched results for files using Google.yaml and Yandex.yaml,
even for the ones where the source language is incorrect. I think it's important to address this issue in order to achieve consistent accuracy across all files.

  1. @Emile-Daisy

CSV file compared: supported_language_pairs.csv.
Accuracy percentage: 0.85%
Reason: After reviewing the code, I observed that it doesn't seem to consider the preferred engines or the mt-
defaults.wikipedia.yaml file. Additionally, it appears that non-standard YAML files, like Google and Yandex, are being
ignored. This could be the reason for the significant difference in total lines between the scraper output and the handler
files.

  1. @Abhishek02bhardwaj

CSV file compared: supported_pairs.csv
Accuracy percentage: 100%

  1. @LeilaKaltouma.

CSV file compared: langs.csv
Accuracy percentage: 100%

Here is the accuracy_results.csv to check the accuracy of all the CSV files mentioned.

awight claimed this task.

Very nice to see that our scraping approach is generally resulting in a 100% match with the API, I think this validates the methodology. Thank you @Ahn-nath @Abhishek02bhardwaj and @Anshika_bhatt_20 for this thorough treatment!

I'm going to treat this task differently than the others and close it now, since it's very conclusively finished. Perhaps the scraper task should be closed as well, now that we can more or less prove that we have several sets of correct results. If anyone is still working and has more to post here, please feel free to continue commenting.

@awight
*Hello,*
*I am an outreachy applicant but stack on how to contribute on this project
on github.*
*Kindly guide me on how to go about.*

*thank you. Regards*

Hello Keepandie, please reach out on Zulip. This task is closed since the goals were met, but feel free to go ahead with it if you think it would be fun. The basic idea is that you'll find data in two different formats (API result structured as JSON, and flattened CSV). The data must be compared between these formats.

Thank you for the feedback, I appreciate it

@Keepandie please don't edit the task description--or perhaps explain your intention and post a draft in a comment before editing the description.

@Aklapper

Sure I made that contribution on description, sorry for the confusion.

@awight please to be sure, does it mean this particular task is closed

@Kachiiee: Please see the line below the task title in the upper left corner of this page.