Page MenuHomePhabricator

Rough integration of time machine and configuration scraper
Closed, ResolvedPublic

Description

We want to build a timeline of which production configurations were enabled for the Content Translation extension and its machine translation component. The groundwork for this is well established by prior work in T331202: Configuration evolution over time and T331201: Extract cxserver configuration and export to CSV, all thanks to the kind contributions from Outreachy participants listed there. You're wonderful, and I'm honored to be writing one more microtask! I hope some of you find this as much fun as I do.

The next step is to integrate our two scripts, running the configuration scraper on every git commit of the cxserver source repository. We expect to iterate and refine the output in later work, so this first pass isn't expected to fit any definition of "correct" or complete. Having the two tools minimally integrated is the end point of the current task, and the starting point for more work.

Again, we can use any programming language and choose whatever project or repository structure seems best.

Must haves:

  • Identify which source code you want to start with.
    • Please review the many excellent repositories linked in the subtasks above and decide on which two to integrate. You can use other participants' contributions, or write your own from scratch as you wish.
    • Fork the repositories you'll start with. (Unnecessary if using your own code.)
  • Point the time machine at the cxserver repository.
  • Adapt the two components so that the time machine calls the configuration scraper with each cxserver commit.
    • Suggestion: Extract output responsibilities away from both components and implement in a new, third module. The scraper can return a data structure. Note that the writer module will need to know both the git timestamp and the config structure, but ideally shouldn't need any direct "knowledge" of git or of how to scrape config. The glue to run the whole integration and the writer can be in the same module.
  • Be robust about config files that appear or disappear. This can be accomplished by switching from an allow-list to a block-list style when visiting the config tree, so we read every potentially relevant file matching "config/*.yaml".
  • Output a single CSV or JSON file, where we will write configuration from all git commits.
    • Write the parsed configuration for each git commit into this file.
    • Include a column with the git commit timestamp

Nice to haves:

  • Filter to only commits changing files under the "config/" directory. We could also filter for actual configuration changes but this is a much smaller optimization so can be ignored.
  • Compare with another independently derived result, if the opportunity arises. This will strongly point to either correctness or problems which can be corrected. Output formats will probably differ, so you may want to coordinate with the other author and agree on some standard. Once files are aligned, "diff" should be sufficient.

Considerations:

  • There will be several major changes to how configuration files are represented in the repository, for example the config directory only comes into existence in 2017 with commit 3474645b2. For now, only consider commits falling after that.
  • Software licenses from forked code continue to apply. Check that the contribution uses an open license and if it doesn't, you can request the author add one. No other due diligence is necessary unless you choose to copy and paste source code from other repositories, in which case you need to preserve the license information by copying over the license file and including a one-line attribution in a comment.

Event Timeline

@awight

Hello, Adam. My contribution (https://github.com/ahn-nath/configuration-evolution-over-time.time-machine) to the Configuration evolution over time task uses JavaScript as the primary language, since the supported and official client library of the GitHub API, Octokit.js, uses it, and I felt it was more appropriate for the project requirements. Nevertheless, my contribution (https://github.com/ahn-nath/wikimedia-cxserver-config-parser) to the Extract cxserver configuration task employs Python as the primary language. At first, I did not feel this would have any negative impact on my final application, but now, I am confused about whether to fork another repository and choose one of my own to merge or if it would be better to rewrite my own code in my preferred language for this contribution.

I would like to have your thoughts on this, thanks.

Hi @Ahn-nath, thank you for noting this problem and apologies that I didn't anticipate it when leaving the choice of language free. It's fine that your linked repos are private but I'll only be able to speak in general terms about a possible integration.

First of all, please feel free to disregard the paragraph about "Extract output responsibilities", this was meant as a hint to help structure the code but it isn't a hard requirement. (I'll update the language in a minute, thanks again!)

I imagine your integration calling the config scraper as if it's a black box. The time machine could exec the scraper as an external command, using eg. child_process.spawn, and the resulting CSV sent via the command's stdout, then parsed back into data.

Hi @Ahn-nath, thank you for noting this problem and apologies that I didn't anticipate it when leaving the choice of language free. It's fine that your linked repos are private but I'll only be able to speak in general terms about a possible integration.

@awight
My repositories will be private for the following couple of days or so, while I make some changes. Nevertheless, you have contributor access to both, just as Simulo. As far as I can see, you are yet to accept my invitation as a contributor. Simulo already has access.

For the rest, thanks. It brings clarity to the task description.

Thank you sir for the update on the task

Hello, @awight. I am having a little bit of problem accessing the cxserver repository. On visiting the link it is showing me "Not Found". I think due to the same reason I am getting an error on executing my python code which gives me an error saying "fatal: path 'config/' exists on disk, but not in '43da799d1b35c4ace5704869e4031784c195c4ed'". Is there something wrong with the link or is it on my local system only.

Hello, @awight. I am having a little bit of problem accessing the cxserver repository. On visiting the link it is showing me "Not Found". I think due to the same reason I am getting an error on executing my python code which gives me an error saying "fatal: path 'config/' exists on disk, but not in '43da799d1b35c4ace5704869e4031784c195c4ed'". Is there something wrong with the link or is it on my local system only.

I think this is something weird about the software used to host this repo, Gerrit. The link I pasted is only good for operations with git, for example git clone https://gerrit.wikimedia.org/r/mediawiki/services/cxserver . If you want to browse the repo, you can visit gitiles or the GitHub mirror.

@awight So I did the integration of the two programs and I gave it a test run. The output file is giving the timestamps of each commit. The only thing is that the output file I am getting is a bit too big (727MB) which is not unexpected I think since the original CSV file had 28k lines and there have been quite a few commits on the repository since 2017 (starting from 16-01-2017). I wanted to ask that how should I upload it because there is a size limit for upload files on both GitHub and Phabrikator. Can you please suggest me something for that.

Hello @awight and @Simulo, this is my work in progress contribution to this task. Could you please take a look and give me some feedback. GitHub Repository

Thanks!

Hello @awight and @Simulo, here is my submission for this task. I would be very happy to have your views over it.
GitHub Repository.
Here is the link for my file. It was too big to be uploaded on github, so I am uploading it to Google drive and sharing the link as I figure out how to use GitHub Large File Storage.
Google Drive Link

Hello @awight and @Simulo, here is my updated submission for this task. I have also uploaded the updated output file in the repository (Thanks to GitHub LFS). I am looking forward to your views and reviews. Thanks in advance.
GitHub Repository.

Hello @awight and @Simulo, this is my work in progress contribution to this task. Could you please take a look and give me some feedback. GitHub Repository

That's elegant code! I only have some very minor observations,

  • In getAllCommits, iterating over pages could be generalized to support any number of commits.
  • Great discovery, that the alternate cross-product algorithm now called transform.js was originally Yandex.js.
  • I would suggest making the output more readable by passing the "space" parameter to JSON.stringify(results, null, 2), or maybe even better switching to the "newline-delimited JSON" format so that each line is an independent object.
  • Impressive method of fetching raw file contents from the web and parsing directly!

Hi! Please consider resolving this task and moving any pending items to a new task, as GSoC/Outreachy rounds are now over, and this workboard will soon be archived.

As Outreachy Round 26 has concluded, closing this microtask. Feel free to reopen it for any pending matters.