Reference: T246435
==Improve the framework to transfer files over the LAN==
===Profile Information===
Name: Ajumal P A
IRC nickname on Freenode: batm
Web Profile: https://github.com/ajupazhamayil
Location (country): India
Typical working hours: 05:00 PM to 01:00 AM - GMT +05:30
===Synopsis===
The transfer method in transfer.py is responsible for copying data from one host to another. It is a highly used function in DB management. Currently, it runs serially. It mainly uses the cumin library and nc command to fulfil the requirement. It can open firewalls, perform compression and encryption to the input. My project aims to improve this framework.
This project can be formulated by making the following changes:
* Automatic port detection:
The framework needs a port to be passed as an argument. This can be automated by selecting a free port in the destination machine. We can start checking the free port using ss utility from a default starting port number. Also, we need to develop an error handler to take care of the race condition (after checking, some other process may grab it before nc start listening to it).
* Parallel checksum and sanity checking:
The framework performs sanity checks such as has_available_disk_space, file_exisistsfor all the destination hosts in a serial manner, then calls calculate_checksome for the source host. First two checks can be parallelized in the case of multiple hosts using cumin async multiple host execution (same independent commands+multiple hosts: Good for cumin async). Also, we can move the calculate_checksum from sanity_checks function to the transfer function so that it will be done in parallel to the actual transmission. It can also be applied to after_transfer_checks. I need a little bit more discussion and investigation on how to structure it.
* Run copy_to function in multiprocessing in case of multiple destinations.
The framework at a time can transfer only one file/directory from a source to destination. nc can assign ports automatically in case of non-listening mode. So we can make a number of subprocesses to run the copy_to function depending upon the load in the machine in which the program is running. The load can be calculated using uptime utility and depending upon the average load in the past one minute, we can fix the number of copy_to instances that need to be run in the machine.
- Possible Mentors: Jaime Crespo aka #jynus [@jcrespo], Manuel Arostegui [@Marostegui]
- Have you contacted your mentors already: Yes
- **Proposal on Wikimedia Phabricator:** https://phabricator.wikimedia.org/T248256
===Deliverables===
I have gone through MariaDB general overview, transfer.py, RemoteExecution.py, LocalExecution.py and CuminExecution.py and got an idea about the flow of the code. Coming weeks I will learn about Gerrit, cumin, rake ..etc a little bit more.
Schedule of deliverables: Starting from May 4th:
Week 1 - Week 2: Get to know the community. Discuss the challenges and implementation steps with the mentor. Design the structure of changes.
Week 3 - Week 5: Automatic port detection.
Week 6 - Week 7: Test creation and Documentation (Milestone for Phase 1 evaluation)
Week 8 - Week 8: Parallel checksumming
Week 9 - Week 11: Parallel sanity checking and Testing (Milestone for Phase 2 evaluation)
Week 12 - Week 12: Continue the above milestone with thorough testing and debugging.
Week 13 - Week 14: Run copy_to function in multiprocessing.
Week 15 - Week 15: Test creation and documentation. (Milestone for Final evaluation)
I expect my work to be completed earlier than I mentioned. After the completion of the above steps, I expect the outcome to be a stable, fast framework to transfer files. Then, if time allows, I would like to continue research/working on the issues:
- Transform the current method into an abstract factory driven by configuration. < Which all are the parameters need to be configured and is it like just add a read_file method and take the values line by line?>
- Add progress bar to the framework <I think using pv utility and job['pipe']?>.
===Participation===
I will maintain the source code in the Wikimedia Git repository.
I will use IRC in my working hours to collaborate with the mentors.
I will use Phabricator for tracking issues.
I will use Gmail for communication in non-working hours.
===About Me===
I am pursuing M.Tech(Master of Technology-Final semester) in Computer Science and Engineering at National Institute of Technology, Karnataka, Surathkal, India. The institute has taught me the spirit of innovation and creativity which helped me to understand problems and address the needs of the community.
I appreciate Wikimedia for its mission to serve every human being. As a free software enthusiast, I admire the way Wikimedia promotes open source by creating quality products. I enjoy coding and doing research on challenging problems. I honestly believe that my experience, certification and the knowledge I gathered during bachelors and masters degree can be applied to solve this project.
I don’t have any commitments that are likely to cause interruption to my work during the time of GSOC. I can put the required efforts without fail to get the required output.
===Past Experience===
I worked at Tata Consultancy Services in some of their research projects. During my work there, I have developed strong skills in Python. My responsibilities included application development(in Flask and Django) and its maintenance, the creation of scripts in SQL and maintenance of the Postgresql XL databases. I got certified by Linux Foundation as The Linux Foundation System Administrator for the period of 2016-2018.
I was fortunate to work with Mozilla for GSoC 2019. The following links will lead you to some of the contributions I have made to the open-source community.
**I proposed a patch relevant to Wikimedia SREs/DBAs team, got feedback, approval and deployed to production.**
* https://phabricator.wikimedia.org/T204110
**Other contributions related to web development and networking:**
* GSoC 2019 with Mozilla (Collaborator): https://github.com/mozilla/TUID/blob/dev/docs/GSoC_Final_Report_2019.md
* GCi Mentor for ns-3: https://www.nsnam.org/wiki/GCI2019Details
- https://phabricator.wikimedia.org/T204110
- https://github.com/mozilla/addons-server/pull/10194
- https://github.com/aqm-eval-suite/ns-3-dev-git/pull/3
**Relevant discussion participated:**
- https://phabricator.wikimedia.org/T246435