Configuration evolution over time
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	awight
	Mar 4 2023, 6:39 PM

Description

Language support for machine translations (and other software features) have changed over time. This task will develop a tool to track this evolution over time.

As a simplified substitute for the full configuration, create a new git repository containing a JSON or CSV file with a simple data structure, for example something like this:

city,temperature
Grand Forks,-41
Berlin,4.1
Oodnadatta,41

Make a few git commits with changes to this data, including adding and removing rows. In a separate repository, write a parser in your preferred programming language, which reads the data into a native structure in memory. Now build a "time machine" that plays back the data repository's git history and parses the data at each commit, storing the entire sequence in memory along with the timestamp of the git commit.

Export the sequence as a flat structure, in the same format as the original CSV but with an additional column for the commit timestamp and all data for that commit repeated.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T328597 [Outreachy round 26] Research into translation imbalances
		Resolved		None	T331202 Configuration evolution over time

Event Timeline

awight created this task.Mar 4 2023, 6:39 PM

awight changed the visibility from "Public (No Login Required)" to "acl*outreachy-mentors (Project)".

awight renamed this task from task 4 to Configuration evolution over time.Mar 4 2023, 6:57 PM

awight updated the task description. (Show Details)

awight mentioned this in T328597: [Outreachy round 26] Research into translation imbalances.Mar 4 2023, 7:33 PM

srishakatux changed the visibility from "acl*outreachy-mentors (Project)" to "Public (No Login Required)".Mar 6 2023, 8:32 PM

Adedolapooye subscribed.Mar 8 2023, 10:02 AM

LeilaKaltouma subscribed.Mar 9 2023, 11:40 PM

Olamide_Oladipo subscribed.Mar 10 2023, 1:09 PM

Manuellabubakar subscribed.Mar 10 2023, 4:26 PM

Meghasharma16 subscribed.Mar 11 2023, 5:03 AM

Hello @awight, @Aklapper , @srishakatux, and @Simulo, I wrote a Javascript code that goes through the history of a JSON file containing a few city names and their temperature. You can access the code here. The script returns a CSV file with the added timestamps. Could you please take a look and tell me if I am on the right path? Thanks

cities.csv922 BDownload

@awight @Simulo For this task, I have tried to create a tool to track the evolution of language support for machine translations (and other software features) over time. To do this, I created a new Git repository containing a CSV file with a simple data structure (city and temperature). I made several Git commits with changes to this data, including adding and removing rows.

also In a separate repository, I wrote a parser in ruby that reads the data into a structure in memory. also, I tried to build a "time machine" that plays back the data repository's Git history and parses the data at each commit, storing the entire sequence in memory along with the timestamp of the Git commit.

Finally, I exported the sequence as a flat structure in the same format as the original CSV but with an additional column for the commit timestamp and all data for that commit repeated as the tasks given description. {F36914689}

I have uploaded my project files to a new GitHub repository, which can be accessed at the following link: https://github.com/anshikabhatt/Time-Machine-for-Software-Configurations. Although it is possible that I made some mistakes along the way, I am open to feedback and eager to learn how I can improve. please have a look and give me your valuable feedback and tell me if I made any mistake. Thank you!

Abhishek02bhardwaj subscribed.Mar 13 2023, 5:58 PM

This comment was removed by Abhishek02bhardwaj.

awight mentioned this in T331201: Extract cxserver configuration and export to CSV.Mar 14 2023, 3:13 PM

@awight and @Simulo I re-read my code and realised that there were certain structural glitches. The time stamps that were appended in the CSV file were not the ones when the commit took instead they were some else. So i wrote the code again from scratch and used a little different approach.
I am leaving a small summary of my work - This code is designed to track changes to a CSV file over time using Git. It includes three functions: parse_csv(), run_git_command(), and get_commits(). The parse_csv() function reads in a CSV file, converts it to a dictionary, and returns it. The run_git_command() function runs a given Git command and returns the output. The get_commits() function retrieves a list of all Git commits and for each commit, retrieves the timestamp, CSV data, and appends them to a list of dictionaries. Finally, the export_csv() function writes the accumulated data for each commit into a new CSV file that includes an additional column for the commit timestamp.
The main program calls these functions and exports the data history to a CSV file named 'data_history.csv'. The data is arranged in a flat structure, with each row representing a single commit, and includes the city, temperature, and timestamp for that commit.
Link to the Github Repository - https://github.com/Abhishek02bhardwaj/Evolution-Tracker
The Updated CSV file -

data_history.csv4 KBDownload

I will be very grateful if you could take out some time to review my submission. I have also updated the read-me file. Since I am new to writing open-source code I do not have that much of expertise in code documentation, therefore if you could give me any suggestions to improve my read-me, I would be very thankful.

Abhishek02bhardwaj attached a referenced file: F36911432: data_history.csv. (Show Details)Mar 14 2023, 7:07 PM

I hear that it wasn't obvious to everyone that I've commented in GitHub. Please see the comments linked from the commit history of your repositories, like in this screenshot:

In T331202#8702136, @awight wrote:

I hear that it wasn't obvious to everyone that I've commented in GitHub. Please see the comments linked from the commit history of your repositories, like in this screenshot:

Hello, @awight Thank you for letting me know about the comments on my GitHub repository. I apologize for not seeing them earlier. I just saw your comment today and I have made the changes according to your feedback. Thank you for taking the time to review my work and provide your valuable feedback. https://github.com/anshikabhatt/Time-Machine-for-Software-Configurations

In T331202#8694071, @Abhishek02bhardwaj wrote:

@awight and @Simulo I re-read my code and realised that there were certain structural glitches. The time stamps that were appended in the CSV file were not the ones when the commit took instead they were some else. So i wrote the code again from scratch and used a little different approach.
I am leaving a small summary of my work - This code is designed to track changes to a CSV file over time using Git. It includes three functions: parse_csv(), run_git_command(), and get_commits(). The parse_csv() function reads in a CSV file, converts it to a dictionary, and returns it. The run_git_command() function runs a given Git command and returns the output. The get_commits() function retrieves a list of all Git commits and for each commit, retrieves the timestamp, CSV data, and appends them to a list of dictionaries. Finally, the export_csv() function writes the accumulated data for each commit into a new CSV file that includes an additional column for the commit timestamp.
The main program calls these functions and exports the data history to a CSV file named 'data_history.csv'. The data is arranged in a flat structure, with each row representing a single commit, and includes the city, temperature, and timestamp for that commit.
Link to the Github Repository - https://github.com/Abhishek02bhardwaj/Evolution-Tracker
The Updated CSV file -
data_history.csv4 KBDownload

I will be very grateful if you could take out some time to review my submission. I have also updated the read-me file. Since I am new to writing open-source code I do not have that much of expertise in code documentation, therefore if you could give me any suggestions to improve my read-me, I would be very thankful.

@awight I think I might have forgot to mention in this comment that i had to change my repository due to which I removed the comment prior to it. I have tried to accommodate the suggestions you commented in the earlier repository. I would be really grateful to have your views over it.
Repo link - https://github.com/Abhishek02bhardwaj/Evolution-Tracker

@awight , @Simulo

I have a few changes left to complete my solution for this task and would appreciate it if any of you could review the repository if the review period is not over yet.

Repository: https://github.com/ahn-nath/configuration-evolution-over-time.time-machine

Thanks.

awight mentioned this in T332643: Rough integration of time machine and configuration scraper.Mar 20 2023, 10:01 PM

srishakatux moved this task from Backlog to Microtasks on the Outreachy (Round 26) board.Mar 24 2023, 11:44 AM

In T331202#8708613, @Ahn-nath wrote:

Repository: https://github.com/ahn-nath/configuration-evolution-over-time.time-machine

With tests! Very cool.

It looks like you're aware of the issues, but testing against a live API is a heavily-debated topic... The benefit is that you'll find out right away if the upstream interface has changed. The drawbacks are many, such as requiring actual credentials to be available in continuous integration environments, fragility and flapping due to service disruption, slowness, etc. Usually we choose a compromise which depends on the capabilities of the programming environment, but comes down to picking a boundary between layers and mocking everything upstream of that. For example, if the HTTP layer can be mocked then you expect an outgoing request and return a constant response (example). If we trust "octokit" to remain stable, then we can replace that whole library with a mock by using dependency injection, etc. (example). It's never that fun, and of course it's just nice to have any tests at all in the repo here. I'm mentioning this bigger debate in case it's of interest.

Also very impressive to see the "unhappy" test cases on broken input.

You can remove the moment dependency, I see you're directly manipulating the date strings which funny enough is the official recommendation of the moment library itself: https://momentjs.com/docs/

Probably unnecessary to try/catch the entire body of generateCSVFilebyLatestCommitsTracked, it's usually safer to let the exception continue bubbling up. Also, calling that function at the root level of app.js could be a bit problematic because it makes the file untestable.

I'm interested in last_date, this seems to be checkpointing so that the processing can be stopped and resumed, picking up at the place it left off. Or run again in the future to pick up new changes. Very sophisticated! It might be possible to extract that logic a bit further so that it can be used as a decorator, it comes to mind because I was also doing something similar recently: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/9/diffs?commit_id=24659edb05268aed826e71c19149ffb730c0855f

Hello @awight @Simulo , for this task I write python code which uses the csv and subprocess modules to read data from a CSV file, initialize a Git repository, and then parse the data at each commit in the repository's history. The parsed data is stored in a list of dictionaries, with each dictionary representing a row in the CSV file. parse_csv function reads the contents of a CSV file using csv.DictReader, which returns an iterator of dictionaries representing each row in the file. The function appends each row to the data list. subprocess module is used to initialize a new Git repository with git init, add the initial CSV file to the repository with git add, and commit the file with git commit. The code then loops over all commits in the Git history using git log and checks out each commit with git checkout.The timestamp of each commit is obtained using git show with the --format option and parsed using datetime.strptime. Finally, the code exports the data list as a flat structure CSV file with an additional column for commit timestamp using csv.DictWriter.
Here is my contribution link for this task - https://github.com/akanshajais/Configuration_evolution_over_time and i am looking forward for your feedback and guidance.

In T331202#8748484, @awight wrote:

It looks like you're aware of the issues, but testing against a live API is a heavily-debated topic... The benefit is that you'll find out right away if the upstream interface has changed. The drawbacks are many, such as requiring actual credentials to be available in continuous integration environments, fragility and flapping due to service disruption, slowness, etc. Usually we choose a compromise which depends on the capabilities of the programming environment, but comes down to picking a boundary between layers and mocking everything upstream of that. For example, if the HTTP layer can be mocked then you expect an outgoing request and return a constant response (example). If we trust "octokit" to remain stable, then we can replace that whole library with a mock by using dependency injection, etc. (example). It's never that fun, and of course it's just nice to have any tests at all in the repo here. I'm mentioning this bigger debate in case it's of interest.

I know that is an industry standard to always try to mock API tests when the API is expected to be stable. Nevertheless, I wanted to make it call the API for the first test because, when testing the tool, it is hard to know if the API is actually available or not for the credentials we use without making an actual call and figuring it out on production or while generating an actual output feels riskier to me in this case. I have also dealt with GitHub Actions for CI/CD so making the credentials available, especially for GitHub and its API, was the only technical debt I considered to be serious enough to stop me at the time. Of course, I am open to discussing this particular line further and getting more feedback on how to improve as I do have a tendency to always make one test of this kind available to make sure the API is can receive calls.

You can remove the moment dependency, I see you're directly manipulating the date strings which funny enough is the official recommendation of the moment library itself: https://momentjs.com/docs/

Implemented!

Probably unnecessary to try/catch the entire body of generateCSVFilebyLatestCommitsTracked, it's usually safer to let the exception continue bubbling up.

I agree, I added the try-and-catch block as a placeholder for more specific exceptions that we should catch as I could not find any specific call we should catch at the moment.

Also, calling that function at the root level of app.js could be a bit problematic because it makes the file untestable.

This makes sense. I was testing the function and should have removed it. I have implemented this.

I'm interested in last_date, this seems to be checkpointing so that the processing can be stopped and resumed, picking up at the place it left off. Or run again in the future to pick up new changes. Very sophisticated! It might be possible to extract that logic a bit further so that it can be used as a decorator, it comes to mind because I was also doing something similar recently: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/9/diffs?commit_id=24659edb05268aed826e71c19149ffb730c0855f

Interesting. Thanks for sharing the repository. I noticed that some repositories, especially the one of interest for the extension of this task, can have n (undefined number) commits, so having to process everything each time can be a very expensive operation.

Thanks for the feedback, @awight.

Snssn1231 subscribed.Apr 3 2023, 3:37 PM

This comment was removed by Snssn1231.

LeilaKaltouma mentioned this in T333850: Proposal: Research into translation imbalances - Outreachy - 26(Leila Kaltouma).Apr 3 2023, 3:50 PM

Kachiiee mentioned this in T333995: Proposal for research into translation imbalance.Apr 4 2023, 5:32 PM

Kachiiee mentioned this in T333998: Outreachy Proposal: Proposal for research into translation imbalance.Apr 4 2023, 5:37 PM

Hi! Please consider resolving this task and moving any pending items to a new task, as GSoC/Outreachy rounds are now over, and this workboard will soon be archived.

As Outreachy Round 26 has concluded, closing this microtask. Feel free to reopen it for any pending matters.

	Restricted File
	Mar 16 2023, 4:46 PM

	F36914345: image.png
	Mar 16 2023, 1:04 PM

	Restricted File
	Mar 15 2023, 2:11 AM

	F36908389: cities.csv
	Mar 13 2023, 12:22 AM

Configuration evolution over timeClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Configuration evolution over time
Closed, ResolvedPublic
Actions

Related Objects
Search...