We want to build a timeline of which production configurations were enabled for the Content Translation extension and its machine translation component. The groundwork for this is well established by prior work in T331202: Configuration evolution over time and T331201: Extract cxserver configuration and export to CSV, all thanks to the kind contributions from Outreachy participants listed there. You're wonderful, and I'm honored to be writing one more microtask! I hope some of you find this as much fun as I do.
The next step is to integrate our two scripts, running the configuration scraper on every git commit of the cxserver source repository. We expect to iterate and refine the output in later work, so this first pass isn't expected to fit any definition of "correct" or complete. Having the two tools minimally integrated is the end point of the current task, and the starting point for more work.
Again, we can use any programming language and choose whatever project or repository structure seems best.
Must haves:
- Identify which source code you want to start with.
- Please review the many excellent repositories linked in the subtasks above and decide on which two to integrate. You can use other participants' contributions, or write your own from scratch as you wish.
- Fork the repositories you'll start with. (Unnecessary if using your own code.)
- Point the time machine at the cxserver repository.
- Adapt the two components so that the time machine calls the configuration scraper with each cxserver commit.
- Suggestion: Extract output responsibilities away from both components and implement in a new, third module. The scraper can return a data structure. Note that the writer module will need to know both the git timestamp and the config structure, but ideally shouldn't need any direct "knowledge" of git or of how to scrape config. The glue to run the whole integration and the writer can be in the same module.
- Be robust about config files that appear or disappear. This can be accomplished by switching from an allow-list to a block-list style when visiting the config tree, so we read every potentially relevant file matching "config/*.yaml".
- Output a single CSV or JSON file, where we will write configuration from all git commits.
- Write the parsed configuration for each git commit into this file.
- Include a column with the git commit timestamp
Nice to haves:
- Filter to only commits changing files under the "config/" directory. We could also filter for actual configuration changes but this is a much smaller optimization so can be ignored.
- Compare with another independently derived result, if the opportunity arises. This will strongly point to either correctness or problems which can be corrected. Output formats will probably differ, so you may want to coordinate with the other author and agree on some standard. Once files are aligned, "diff" should be sufficient.
Considerations:
- There will be several major changes to how configuration files are represented in the repository, for example the config directory only comes into existence in 2017 with commit 3474645b2. For now, only consider commits falling after that.
- Software licenses from forked code continue to apply. Check that the contribution uses an open license and if it doesn't, you can request the author add one. No other due diligence is necessary unless you choose to copy and paste source code from other repositories, in which case you need to preserve the license information by copying over the license file and including a one-line attribution in a comment.