GSoC 2020 Proposal: Improve the framework to transfer files over the LAN
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Privacybatm
	Mar 22 2020, 10:54 AM

Description

Profile Information

Name: Ajumal P A
IRC nickname on Freenode: batm
Web Profile: https://github.com/ajupazhamayil
Location (country): India
Typical working hours: 05:00 PM to 01:00 AM - GMT +05:30

Synopsis

The transfer method in transfer.py is responsible for copying data from one host to another. It is a highly used function in DB management. Currently, it runs serially. It mainly uses the cumin library and nc command to fulfil the requirement. It can open firewalls, perform compression and encryption to the input. My project aims to improve this framework.
This project can be formulated by making the following changes:

Automatic port detection:

The framework needs a port to be passed as an argument. This can be automated by selecting a free port in the destination machine. We can start checking the free port using ss utility from a default starting port number. Also, we need to develop an error handler to take care of the race condition (after checking, some other process may grab it before nc start listening to it).

Parallel checksum and sanity checking:

The framework performs sanity checks such as has_available_disk_space, file_exists for all the destination hosts in a serial manner, then calls calculate_checksum for the source host. First two checks can be parallelized in the case of multiple hosts using cumin async multiple host execution (same independent commands+multiple hosts: Good for cumin async). Also, we can move the calculate_checksum from sanity_checks function to the transfer function so that it will be done in parallel to the actual transmission. It can also be applied to after_transfer_checks. I need a little bit more discussion and investigation on how to structure it.

Run copy_to function in multiprocessing in case of multiple destinations.

The framework at a time can transfer only one file/directory from a source to destination. nc can assign ports automatically in case of non-listening mode. So we can make a number of subprocesses to run the copy_to function depending upon the load in the machine in which the program is running. The load can be calculated using uptime utility and depending upon the average load in the past one minute, we can fix the number of copy_to instances that need to be run in the machine.

Possible Mentors: Jaime Crespo aka #jynus [@jcrespo], Manuel Arostegui [@Marostegui]
Have you contacted your mentors already: Yes
Proposal on Wikimedia Phabricator: https://phabricator.wikimedia.org/T248256

Deliverables

I have gone through MariaDB general overview, transfer.py, RemoteExecution.py, LocalExecution.py and CuminExecution.py and got an idea about the flow of the code. Coming weeks I will learn about Gerrit, cumin, rake, etc. a little bit more.

Schedule of deliverables: Starting from May 4th:
Week 1 - Week 2: Get to know the community. Discuss the challenges and implementation steps with the mentor. Design the structure of changes.
Week 3 - Week 5: Automatic port detection.
Week 6 - Week 7: Test creation and Documentation (Milestone for Phase 1 evaluation)
Week 8 - Week 8: Parallel checksumming
Week 9 - Week 11: Parallel sanity checking and Testing (Milestone for Phase 2 evaluation)
Week 12 - Week 12: Continue the above milestone with thorough testing and debugging.
Week 13 - Week 14: Run copy_to function in multiprocessing.
Week 15 - Week 15: Test creation and documentation. (Milestone for Final evaluation)
I expect my work to be completed earlier than I mentioned. After the completion of the above steps, I expect the outcome to be a stable, fast framework to transfer files. Then, if time allows, I would like to continue research/working on the issues:

Transform the current method into an abstract factory driven by configuration.
Add progress bar to the framework.

Participation

I will maintain the source code in the Wikimedia Git repository.
I will use IRC in my working hours to collaborate with the mentors.
I will use Phabricator for tracking issues.
I will use Gmail for communication in non-working hours.

About Me

I am pursuing M.Tech(Master of Technology-Final semester) in Computer Science and Engineering at National Institute of Technology, Karnataka, Surathkal, India. The institute has taught me the spirit of innovation and creativity which helped me to understand problems and address the needs of the community.
I appreciate Wikimedia for its mission to serve every human being. As a free software enthusiast, I admire the way Wikimedia promotes open source by creating quality products. I enjoy coding and doing research on challenging problems. I honestly believe that my experience, certification and the knowledge I gathered during bachelors and masters degree can be applied to solve this project.
I don’t have any commitments that are likely to cause interruption to my work during the time of GSoC. I can put the required efforts without fail to get the required output.

Past Experience

I worked at Tata Consultancy Services in some of their research projects. During my work there, I have developed strong skills in Python. My responsibilities included application development(in Flask and Django) and its maintenance, the creation of scripts in SQL and maintenance of the Postgresql XL databases. I got certified by Linux Foundation as The Linux Foundation System Administrator for the period of 2016-2018.
I was fortunate to work with Mozilla for GSoC 2019. The following links will lead you to some of the contributions I have made to the open-source community.

I proposed a patch relevant to Wikimedia SREs/DBAs team, got feedback, approval and deployed to production.

https://phabricator.wikimedia.org/T204110 (Patch: https://gerrit.wikimedia.org/r/583203)

Other contributions related to web development and networking:

GSoC 2019 with Mozilla (Collaborator): https://summerofcode.withgoogle.com/archive/2019/projects/5849522285051904/
GCi Mentor for ns-3: https://www.nsnam.org/wiki/GCI2019Details
https://phabricator.wikimedia.org/T204110
https://github.com/mozilla/addons-server/pull/10194
https://github.com/aqm-eval-suite/ns-3-dev-git/pull/3

Relevant discussion participated:

https://phabricator.wikimedia.org/T246435

Details

Subject	Repo	Branch	Lines +/-
transfer.py: It is a test code for multiprocess transferpy	operations/software/transferpy	master	+19 -3
transfer.py: Refactor split_target function	operations/software/transferpy	master	+44 -9
mariadb-backups: Move transferpy deployment to debian package	operations/puppet	production	+2 -662
wmfmariadbpy: Remove transferpy package	operations/software/wmfmariadbpy	master	+3 -1 K
transferpy: Remove wmfmariadbpy package	operations/software/transferpy	master	+21 -5 K
setup.py: Add RemoteExecution module to setup.py	operations/software/transferpy	master	+1 -0
Add transferpy new repo to tox-based automated CI	integration/config	master	+4 -0
zuul: Add Privacybatm to the list of users that can trigger CI	integration/config	master	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		• jcrespo	T246435 Create or improve a tool for monitoring or automating tasks for Wikimedia databases
Resolved		Privacybatm	T248256 GSoC 2020 Proposal: Improve the framework to transfer files over the LAN
Resolved		Privacybatm	T252171 Automate the detection of netcat listen port in transfer.py
Resolved		Privacybatm	T252950 kill_job function in remote execution module of transfer framework does not close the ports instantly
Resolved		Privacybatm	T252172 Refactor transfer.py
Resolved	BUG REPORT	Privacybatm	T252175 transfer.py fails to run 2 commands
Resolved		Privacybatm	T252802 Improve output message readabiliy of transfer.py
Resolved		Privacybatm	T253219 Add more information to --help option of transfer.py
Resolved		Privacybatm	T253560 Exception raised when setting trivial, but incorrect parameters to transfer.py
Resolved		Privacybatm	T253736 Package transferpy framework
Resolved		• jcrespo	T256725 Execution error after moving to debian package
Resolved		Privacybatm	T254979 Make checksum parallel to the data transfer in transferpy package
Resolved		Privacybatm	T255999 Use logging package instead of print statements in transferpy package
Resolved		Privacybatm	T256450 Solve transferpy concurrency issue with auto port detection and checksum file names
Resolved		Privacybatm	T256951 Choosing a wrong host with transfer.py produces an "ERROR: The specified source path X doesn't exist on Y"
Resolved		Privacybatm	T257599 Create temp and config directories at the installation time of transferpy deb package
Resolved		Privacybatm	T257600 Create more tests for transferpy package
Resolved		Privacybatm	T257601 transferpy 1.0 release
Resolved		Privacybatm	T257602 Make transferpy configurable using a configuration file

Event Timeline

Privacybatm created this task.Mar 22 2020, 10:54 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 22 2020, 10:54 AM

Privacybatm renamed this task from GSoC 2020 draft proposal: Feedback request: Improve the framework to transfer files over the LAN to Feedback request: GSoC 2020 draft proposal: Improve the framework to transfer files over the LAN.Mar 22 2020, 11:00 AM

RhinosF1 added a project: Google-Summer-of-Code (2020).Mar 22 2020, 11:04 AM

QEDK moved this task from Backlog to Accepted Proposals on the Google-Summer-of-Code (2020) board.Mar 22 2020, 11:41 AM

Do not post personal details here, as those will be public! You can include personal information on the Google Summer of Code form submission, as that will be private. Please use the format proposed on: https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants#Application_process_steps

===Profile Information
Name
IRC nickname on Freenode
Web Profile
Resume (optional)
Location (country or state)
Typical working hours (include your timezone)

===Synopsis
- Short summary describing your project and how it will benefit Wikimedia projects
- Possible Mentor(s)
- Have you contacted your mentors already?
===Deliverables
Describe the timeline of your work with deadlines and milestones, broken down week by week. Make sure to include time you are planning to allocate for investigation, coding, deploying, testing and documentation
===Participation
Describe how you plan to communicate progress and ask for help, where you plan to publish your source code, etc
===About Me
Tell us about a few:
- Your education (completed or in progress)
- How did you hear about this program?
- Will you have any other time commitments, such as school work, another job, planned vacation, etc, during the duration of the program?
- We advise all candidates eligible for Google Summer of Code and Outreachy to apply for both programs. Are you planning to apply to both programs and, if so, with what organization(s)?
- What does making this project happen mean to you?
===Past Experience
Describe any relevant projects that you've worked on previously and what knowledge you gained from working on them. Describe any open source projects you have contributed to as a user and contributor (include links). If you have already written a feature or bugfix for a Wikimedia technology such as MediaWiki, link to it here; we will give strong preference to candidates who have done so
===Any Other Info
Add any other relevant information such as UI mockups, references to related projects, a link to your proof of concept code, etc

One thing that you did was writing all the ideas I gave and propose to do them all. I would like for you to limit the scope to a concrete proposal of work you think you will be able to do, and leave extra ideas as a "stretch goal". In other words, your proposal is too wide but lacks on more details. For example, what does "cumin integration" mean?

I think you should reduce the scope to propose a smaller amount of improvements, but more details on what you would do on each and why, and commit to add tests and documentation for every change. Underpromise, over deliver :-D. You may have to go and give a deeper look at the existing code and think what would it take for each improvement, not only to be coded, but to be reviewed and merged into production.

You don't need to refer to covid19 on the schedule, this is an understood issue, which also affects this organization. That is also why we should "focus" the proposal on a smaller 1 or 2 improvements, leaving room for schedule changes- both if things get better or worse.

So my suggestion is to change the proposal format to the one above, and maybe list the changes as bullet points, with more details on each, in order of priority, knowing that some of them may not be fully done.

I like the:

Week 1 - Week 3: Get to know the community. Discuss the challenges and implementation steps with the mentor. Design the structure of changes.

Although 3 weeks of no coding may be too much, I would reduce it to 1-2 weeks. And we can do small code changes then, so you do some coding.

Cumin integration

Unclear what this means.

I suggest we start with the "automatic port detection", which is a small-scope change as the first task, including tests and documentation, and make that the first milestone, then start working on a larger project once you are more familiar with the process.

Thank you for the feedback. I have edited the description in this ticket with your feedback resolved.
Earlier, I was looking at an older version of the transfer.py :-/ (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/326155/8/modules/role/files/mariadb/transfer.py) In that, RemoteExecution was LocalExecution and there was no class concept in transfer.py! My bad.

Please give me feedback. Also can you please point me to a small issue to solve?

Please give me feedback. Also can you please point me to a small issue to solve?

Let me read carefully again the proposal and I will assign you soon one of the small tasks you could do in a few hours' time.

• jcrespo updated the task description. (Show Details)Mar 24 2020, 11:14 AM

Thank, all changes you made make sense and make the project be in a reasonable scope.

For a small issue to solve, please try to do T204110. Only tendril is left without a favicon. Here is one I quickly did for the prototype:

zarcillo.svg1 KBDownload

, so you do not lose time designing the icon- I only care about the procedure.

The ask is to create a patch for the tendril software https://gerrit.wikimedia.org/r/admin/projects/operations/software/tendril with a working favicon. This is mostly html, so no PHP experience needed. But you must demonstrate the ability to read code you didn't write, clone a repo, create a patch and follow the coding and Change-review style guides you will find on https://mediawiki.org. This is a very small task if you know what you should be doing, but if you accomplish it, it will demonstrate you will be able to do larger projects successfully!

Please ask if you get stuck or do not know how to proceed.

Thank you for your feedback! I have added a patch for T204110.

Privacybatm updated the task description. (Show Details)Mar 25 2020, 2:56 PM

Privacybatm updated the task description. (Show Details)Mar 25 2020, 3:35 PM

Thank you for merging the patch :-) Please let me know if I need to do any update to this proposal. (I will remove this < ?> section before the submission)

Thank you for your contribution, you can see it was applied to the code of our organization: https://github.com/wikimedia/operations-software-tendril/commit/265e17fab6295d67cd9d3bb815ced838b4f62236 and then deployed to production. @Marostegui was happy to get a favicon to spot tendril easily.

One last thing- this not something that will be evaluated upon. But Re: I will make a new repo in Github and maintain the source code. The most common development place for Wikimedia related code (and in particular the one Wikimedia Foundation DBAs use) is the Wikimedia Git repository (we informally call it Gerrit, as it lives on the same place than the Change Review tool). Accepted proposals will be given a Wikimedia repo to work with (which will automatically be mirrored on Github, too, attributing of course, properly). Also CI services by us will be provided, too.

You can chose to work on GitHub or keep a copy there outside of the Wikimedia mirror, but this was just a FYI regarding development resources that will be provided by us.

One last advice- make more explicit you were able to propose a patch, get feedback, get approved and deployed to production a code change relevant to this team (Wikimedia SREs/DBAs). You added the link, but make sure that is more prominent, as that, I think, will be evaluated positively.

Privacybatm updated the task description. (Show Details)Mar 26 2020, 6:03 AM

Thank you so much for your valuable feedback @Marostegui @jcrespo :-) I am happy to see the patch deployed. Working on the GitHub repository is not a concern for me. So I have made changes to my proposal. Please take a look at it.

Privacybatm updated the task description. (Show Details)Mar 26 2020, 6:10 AM

Privacybatm updated the task description. (Show Details)Mar 26 2020, 6:30 AM

file_exisists

Privacybatm updated the task description. (Show Details)Mar 26 2020, 8:59 AM

My bad, Thank you

Privacybatm updated the task description. (Show Details)Mar 26 2020, 9:38 AM

Privacybatm updated the task description. (Show Details)Mar 26 2020, 9:48 AM

Aklapper renamed this task from Feedback request: GSoC 2020 draft proposal: Improve the framework to transfer files over the LAN to GSoC 2020 draft proposal: Improve the framework to transfer files over the LAN.Mar 26 2020, 10:01 AM

Privacybatm updated the task description. (Show Details)Mar 27 2020, 4:49 AM

Privacybatm updated the task description. (Show Details)Mar 27 2020, 5:00 AM

Privacybatm renamed this task from GSoC 2020 draft proposal: Improve the framework to transfer files over the LAN to GSoC 2020 Proposal: Improve the framework to transfer files over the LAN.Mar 27 2020, 5:17 AM

Privacybatm updated the task description. (Show Details)

Privacybatm added a parent task: T246435: Create or improve a tool for monitoring or automating tasks for Wikimedia databases.Mar 27 2020, 5:25 AM

• jcrespo triaged this task as Medium priority.May 8 2020, 5:52 AM

Privacybatm closed subtask T252175: transfer.py fails to run 2 commands as Resolved.May 9 2020, 3:14 PM

Change 595968 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[integration/config@master] zuul: Add Privacybatm to the list of users that can trigger CI

https://gerrit.wikimedia.org/r/595968

gerritbot added a project: Patch-For-Review.May 12 2020, 3:48 PM

Change 595968 merged by jenkins-bot:
[integration/config@master] zuul: Add Privacybatm to the list of users that can trigger CI

https://gerrit.wikimedia.org/r/595968

Mentioned in SAL (#wikimedia-releng) [2020-05-12T15:54:35Z] <Reedy> Reloading Zuul to deploy https://gerrit.wikimedia.org/r/595968 T248256

Mentioned in SAL (#wikimedia-releng) [2020-05-12T16:02:20Z] <James_F> Zuul: Manually running fabric against contint1001 to add Privacybatm to the CI allow list T248256

Privacybatm closed subtask T252172: Refactor transfer.py as Resolved.May 15 2020, 1:39 PM

This doesn't have to be a task (or it can be, up to you), but I realized that there is very little comments on the main files. For example, there is not a single comment on https://github.com/wikimedia/operations-software-wmfmariadbpy/blob/master/transferpy/Firewall.py

While I prefer to keep methods small so they are obvious, I think at least every class and every main method should say what it does, and clarify what the arguments are. Now things are easy to follow, but as we add more functionality, it will become more and more difficult to follow and understand what does what. Remember exactly as planned documentation was part of the initial work, we should work on that too :-D.

Yeah sure, I will do that.

I've added a couple of things to https://wikitech.wikimedia.org/wiki/Transfer.py#Wishlist_and_know_issues as an ideas for later work (we don't have to do everything, these are just ideas for improvement)

• jcrespo removed a subtask: T252950: kill_job function in remote execution module of transfer framework does not close the ports instantly.May 20 2020, 4:06 PM

• jcrespo added a subtask: T253560: Exception raised when setting trivial, but incorrect parameters to transfer.py.May 25 2020, 1:45 PM

• jcrespo edited projects, added DBA; removed Patch-For-Review.May 25 2020, 1:49 PM

• jcrespo moved this task from Triage to GSOC2020 on the DBA board.

Hey,

I have created a specific column for tasks related to the GSOC, on the DBA project- I think we should use that to classify it under the DBA tag.

I don't think there is a need right now for a special workboard just for the project (using the DBA one is enough), but if you consider you need one to organize on your own, you can always create a user project: https://www.mediawiki.org/wiki/Phabricator/Project_management#Types_of_Projects If we ended up with a lot of open tickets we can reconsider reorganizing.

The DBA workboard is getting a bit complex, mixing backup and databases work, and transfer.py is a bit of both, so we have not yet decided how to organize in the future.

Okay, I will use GSOC column for the tickets. Thank you!

Privacybatm closed subtask T252802: Improve output message readabiliy of transfer.py as Resolved.May 26 2020, 2:57 AM

• jcrespo closed subtask T252171: Automate the detection of netcat listen port in transfer.py as Resolved.Jun 4 2020, 9:13 AM

• jcrespo closed subtask T253560: Exception raised when setting trivial, but incorrect parameters to transfer.py as Resolved.

Change 602333 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[integration/config@master] Add transferpy new repo to tox-based automated CI

https://gerrit.wikimedia.org/r/602333

gerritbot added a project: Patch-For-Review.Jun 4 2020, 10:48 AM

Change 602333 merged by jenkins-bot:
[integration/config@master] Add transferpy new repo to tox-based automated CI

https://gerrit.wikimedia.org/r/602333

Change 602595 had a related patch set uploaded (by Privacybatm; owner: Privacybatm):
[operations/software/wmfmariadbpy@master] wmfmariadbpy: Remove transferpy package

https://gerrit.wikimedia.org/r/602595

Change 602618 had a related patch set uploaded (by Privacybatm; owner: Privacybatm):
[operations/software/transferpy@master] transferpy: Remove wmfmariadbpy package

https://gerrit.wikimedia.org/r/602618

Change 602879 had a related patch set uploaded (by Privacybatm; owner: Privacybatm):
[operations/software/transferpy@master] setup.py: Add RemoteExecution module to setup.py

https://gerrit.wikimedia.org/r/602879

Change 602879 merged by Jcrespo:
[operations/software/transferpy@master] setup.py: Add RemoteExecution module to setup.py

https://gerrit.wikimedia.org/r/602879

Change 602618 merged by Jcrespo:
[operations/software/transferpy@master] transferpy: Remove wmfmariadbpy package

https://gerrit.wikimedia.org/r/602618

Change 602595 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] wmfmariadbpy: Remove transferpy package

https://gerrit.wikimedia.org/r/602595

• jcrespo mentioned this in rOSMD99c40bd6b4bb: wmfmariadbpy: Remove transferpy package.Jun 18 2020, 12:14 PM

• jcrespo closed subtask T253219: Add more information to --help option of transfer.py as Resolved.Jun 19 2020, 10:48 AM

Privacybatm closed subtask T253736: Package transferpy framework as Resolved.Jun 24 2020, 10:54 AM

Change 608053 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Move transferpy deployment to debian package

https://gerrit.wikimedia.org/r/608053

Change 608053 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Move transferpy deployment to debian package

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608053

• jcrespo added a subtask: T256951: Choosing a wrong host with transfer.py produces an "ERROR: The specified source path X doesn't exist on Y".Jul 2 2020, 9:58 AM

Change 609778 had a related patch set uploaded (by Privacybatm; owner: Privacybatm):
[operations/software/transferpy@master] tranasfer.py: Refactor split_target function

https://gerrit.wikimedia.org/r/609778

• jcrespo closed subtask T256951: Choosing a wrong host with transfer.py produces an "ERROR: The specified source path X doesn't exist on Y" as Resolved.Jul 9 2020, 8:09 AM

Change 610750 had a related patch set uploaded (by Privacybatm; owner: Privacybatm):
[operations/software/transferpy@master] transfer.py: It is a test code for multiprocess transferpy

https://gerrit.wikimedia.org/r/610750

Change 609778 merged by Jcrespo:
[operations/software/transferpy@master] transfer.py: Refactor split_target function

https://gerrit.wikimedia.org/r/609778

Change 610750 abandoned by Jcrespo:
[operations/software/transferpy@master] transfer.py: It is a test code for multiprocess transferpy

Reason:
Proof of concept

https://gerrit.wikimedia.org/r/610750

I have updated a few things that were outadated at https://wikitech.wikimedia.org/wiki/Transfer.py Don't worry too much about that page, I can handle it, focus on the official documentation.

• jcrespo closed subtask T254979: Make checksum parallel to the data transfer in transferpy package as Resolved.Jul 31 2020, 7:26 AM

• jcrespo closed subtask T255999: Use logging package instead of print statements in transferpy package as Resolved.

• jcrespo closed subtask T256450: Solve transferpy concurrency issue with auto port detection and checksum file names as Resolved.

• jcrespo removed a subtask: T256755: transferpy --checksum wrongly output `checksums do not match` message.

• jcrespo closed subtask T257600: Create more tests for transferpy package as Resolved.

• jcrespo closed subtask T257599: Create temp and config directories at the installation time of transferpy deb package as Resolved.Jul 31 2020, 7:28 AM

• jcrespo closed subtask T257602: Make transferpy configurable using a configuration file as Resolved.

• jcrespo removed a subtask: T259327: transferpy: Multiprocess the transfers.

• jcrespo mentioned this in rOSWB99c40bd6b4bb: wmfmariadbpy: Remove transferpy package.Sep 1 2020, 8:36 AM

• jcrespo closed subtask T257601: transferpy 1.0 release as Resolved.Sep 9 2020, 2:50 PM