Page MenuHomePhabricator

[timebox 16hrs] Explore the applicability of monorepo approach for organizing multiple software "packages" in the Wikibase git repository
Closed, ResolvedPublic

Description

As a Wikibase maintainer I'd like have an evaluation of making Wikibase a monorepo so that we can make an informed decision as to pursuing the idea further.

Wikibase repository contains multiple possibly software "packages" ("libraries"). Those could be identified in the PHP code but also in Wikibase Front end code - both legacy and new vuejs code. There is also shared logic stored in i18n messages and their translations, and definitions of ResourceLoader modules used in Repo and Client. This includes the "packages" that are expected to be discovered when reorganizing the shared logic (e.g. in "Lib"). Current process of releasing, and updating packages is fairly costly. It is possible that organizing the code of those "package" in as single VCS repository would make the maintenance effort significantly lower.

It seems plausible to use the monorepo strategy to maintain multiple software packages in a single VCS repository. One of benefits foreseen is simplifying the process of releasing new version of those packages and updating version of them used in the actual Repo and Client applications.

Scope constraints: It would be strongly preferred to have a monorepo structure for both backend (PHP) and frontend code bases. If possible including shared i18n messages and ResourceLoader modules would be beneficial but it is not considered mandatory given no-code nature of these.

Approach: Perform two spikes to try out a certain technical solutions and compile list of findings:

  1. composer-based solution experimented with by @Addshore [ link 1 ], [ link 2 ]
  2. some standard tool, e.g. split or any sensible tool listed in the awesome-monorepo list.

Note: the goal of this task is not to start migrating Wikibase towards monorepo model. It is about doing short-lived experimentation with the intention of gathering knowledge allowing to make decision on applying some implementation of monorepo strategy to Wikibase.

Note: Consider how to enforce clear boundaries between separate "packages" (i.e. that libraries do not madly depend on each other etc)

Timebox: 16 person hours (for the whole team to distribute)

Acceptance criteria:

  • There is an evaluation provided (linked in this task) of feasibility and applicability of the using monorepo strategy for the software "packages" Wikibase

Evaluation should:

  • highlight benefits and drawbacks of solution tested
  • list identified challenges (solving challenges is not expected)
  • provide some rough estimate of effort required for introduction and maintenance of the evaluated solution.
  • There is a comparison of options evaluated in case of evaluating multiple solutions provided (linked in this task)

Event Timeline

WMDE-leszek renamed this task from Explore the applicability of monorepo approach for organizing multiple software "packages" in the Wikibase git repository to [timebox 16hrs] Explore the applicability of monorepo approach for organizing multiple software "packages" in the Wikibase git repository.Jun 10 2020, 2:17 PM
WMDE-leszek updated the task description. (Show Details)

I view git subtree more as a one-time operation… if we have a library that’s currently maintained in a separate Git repository and we want to add it to the monorepo, we can use git subtree add to preserve its history rather than just copy+paste the code; conversely, if we want to split a Git repository out of the main Wikibase repository (e. g. to split it into one Git repository per extension, with Client and Repo in separate repositories), we can use git subtree split. (But I haven’t used git subtree myself yet.)

I view git subtree more as a one-time operation… if we have a library that’s currently maintained in a separate Git repository and we want to add it to the monorepo, we can use git subtree add to preserve its history rather than just copy+paste the code; conversely, if we want to split a Git repository out of the main Wikibase repository (e. g. to split it into one Git repository per extension, with Client and Repo in separate repositories), we can use git subtree split. (But I haven’t used git subtree myself yet.)

Symfony uses it in every commit to replicate to separate read-only repositories: https://www.youtube.com/watch?v=4w3-f6Xhvu8

I'm trying to get my head around it and see how it works but it seems a little complicated. Stay tuned.

I haven't finished this ticket but I wanted to share what I've got so far:

  • Monorepos are more prevalent than I thought. We all know about Google's monstrous repo. But lots of other places use it too, like symfony, babbel, and so many more.
  • Specially for nodejs packages, it makes much more sense given how granular npm packages are like is-thirteen, is-odd, is-even, etc. These are joke packages but the whole issue that npm packages are sometimes extremely granular stands correct for example, the microfrontend of tainted references has around 3000 dependencies (do a fresh install). So having monorepo for npm packages makes sense and there is already a tool for that called lerna. There are two presentations that talk about monorepos in npm packages: https://www.youtube.com/watch?v=rdeBtjBNcDI and https://www.youtube.com/watch?v=7Lr8xYPKG5w but since the coupling between our components are between php code and RL modules, I'm not sure we should use it specially given how our asset manager (RL) works. At least for now. Maybe we can revisit this once the build step has been implemented and widely used.
  • facebook uses something called fbshitit to rewrite (git) history. It's written in Hack and unrunnble in our infra AFAIK. Not that useful anyway.
  • symfony (https://www.youtube.com/watch?v=4w3-f6Xhvu8) does an interesting job in doing monorepos in php packages The point is the distinction of places that development and distribution happens. There's a large monorepo that has all the code in directories, after each pull request being merged, the updater (named "splitsh" that was mentioned) does a "git split" and copy the content to read-only git repos in github and those become composer packages and gets released and tagged, etc. This is not hard to implement in our side, specially since we already mirror the code to github, we can also do that too.
    • Here's an example: This commit gets merged to symfony/symfony and it gets reflected in symfony/console in this commit automatically. The name, author, commit message, nothing changes except the git hash.
  • In contrast, the addwiki makes a new "mono-repo sync update" commit that doesn't have much info in it, here's an example: https://github.com/addwiki/wikibase-api/commit/65c4db0db4b4112f260cfe46a9813d8fc9eba9d8 (the bash file that generates the commit). To be honest, I don't like this amount of losing git history (with git shallow clone and auto generated commit) specially when you compare it with with symfony does with git split.
  • Automating syncs the way symfony does it is not that hard, we can build a sync.sh file and wire it to github actions of the mirror in github to sync to several read-only repos in wmde org in github (if github actions doesn't let us run arbitrary bash files, we can simply run it automatically with a cronjob in the cloud)
  • Using "git split" is useful when you want to preserve the history, we need to move files and change them so much that usually thinks they are two different files, we lose the history anyway. You can argue that it's not useful for now but once we migrated then it'll be useful which I totally agree with that.
  • A really monstrous monorepo has its own downsides too. Not to mention very special tooling google had to implement but also this presentation briefly mentions that Google had lots of trouble upgrading JUnit because it had to be done in one commit (for whatever reason two versions side-by-side wasn't possible) and doing that with one big commit across terabytes of code wasn't easy. I assume we won't be that big that doing everything in one commit can cause problems, but it's just worth mentioning.
  • My biggest worry is testing, lots of github repos can build their testing for themselves but in our infra, it might get complicated. Also, how tests should work? The current way is that jenkins spins up a full fleshed mediawiki instance and then runs a maintenance script called phpunit.php. If we let everything be tested like this, we lose the whole use of modularity here and people (unintentionally) will start depending on other package classes and tests pass and they turn to basically directories instead of packages. On the other extreme, if we don't autoload the classes and rely on vendor instead and run tests separately from each commit, we defy the whole point of having monorepo in the first place. The most feasible solution I could think of is this:
    • Introduce a new jenkins/quibble job to go through each package and run "composer install && composer test". That doesn't autoload anything else meaning if we add a forbidden dependency, these tests fail. I need to emphasize these have to be pure unit tests (or at least, the integration test has to be isolated to the package only)
    • Make sure class packages are autoloaded in the "lib" extension (or client extension depending on the outcome of T254922) (I don't know how yet) and we don't use vendor in phpunit.php tests, that way we can make sure the whole extension works with the code and we don't need to go through the pain of "release -> patch Wikibase to pick it -> upgrade vendor" every time.
  • The other problem is coupling to mediawiki. Even term store code has some code that depends on mediawiki, meaning we can't take them out yet. One idea is to actually upstream this monorepo solution and make core publish composer packages too (like includes/libs/rdbms) and then we can safely couple to very small part of core (published in composer) in our modules. The problem is also not just one class having a dependency, it's how we can respond to that. Can we move all term store code into one package except that naughty class that depends on core? well, if other classes in term store depend on that naughty class? If we separate the term store to "naughty classes" and "nice classes" like Santa, and move the nice classes, we are basically breaking a coherent module to incoherent pieces.
  • Also one big problem that is not related to this case is that the lib/ is pretty muddy and and there's no easy way to take things out of it. We should build a clear map of lib and the neighborhoods we can split.
  • Last point: I love the idea but it seems it's much bigger than a six-week project.

it's much bigger than a six-week project

to remind on the goal of this "six-week project": Long-term strategy outlined for organizing the “shared” logic

Would you say drafting the strategy including the monorepo is still not feasible in the time allocated (e.g. due to estimated time required to look into addressing challenges listed)?

it's much bigger than a six-week project

to remind on the goal of this "six-week project": Long-term strategy outlined for organizing the “shared” logic

Would you say drafting the strategy including the monorepo is still not feasible in the time allocated (e.g. due to estimated time required to look into addressing challenges listed)?

I think it'll be feasible to outline the strategy and start it but I'm not sure if we can move out a significant part of lib to the new system.

Okay, I built a POC of building read-only minion repos out of our monorepo. I started with tainted refs: https://github.com/Ladsgroup/tainted-refs (as you can see the history works, the commit diffs are fine, etc.) It doesn't contain jenkins merge commits for reasons unknown to me (yet). With adding github actions on minion read-only repos, you can automate npm package releases, etc. So we merge a change in gerrit and a couple of hours later, it's released on packagist/npm.

This python script automates syncing history from mono repo to minion repos:

import tempfile
import subprocess
import time
import os
def run(args):
    print('Running:', ' '.join(args))
    start_time = time.time()
    res = subprocess.run(args, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
    print('Duration (seconds):', time.time() - start_time)
    print('stdout:')
    print(res.stdout.decode('utf-8'))
    print('stderr:')
    print(res.stderr.decode('utf-8'))
    if res.returncode != 0:
        raise Exception

cases = [
    {
        'monorepo_url': 'https://github.com/wikimedia/mediawiki-extensions-Wikibase',
        'name': 'tainted-ref',
        'path': 'view/lib/wikibase-tainted-ref/',
        'target_url': 'https://github.com/Ladsgroup/tainted-refs'
    }
]
for case in cases:
    with tempfile.TemporaryDirectory() as tmpdirname:
        run(['git', 'clone', case['monorepo_url'], tmpdirname])
        os.chdir(tmpdirname)
        run(['git', 'subtree', 'split', '-P', case['path'], '-b', case['name']])
        run(['git', 'remote', 'add', case['name'], case['target_url']])
        run(['git', 'push', '-f', case['name'], case['name'] + ':master'])

It's pretty straightforward (and fully automated if git credentials store are set). Here's an example run:

amsa@amsa-Latitude-7480:~/workspace$ python monorepo_updater.py 
Running: git clone https://github.com/wikimedia/mediawiki-extensions-Wikibase /tmp/tmpctrn4ykh
Duration (seconds): 56.4809193611145
stdout:

stderr:
Cloning into '/tmp/tmpctrn4ykh'...

Running: git subtree split -P view/lib/wikibase-tainted-ref/ -b tainted-ref
Duration (seconds): 106.29542660713196
stdout:
69607d8c5b6ff96eff2288f78cd364ebaa642cca

stderr:
Created branch 'tainted-ref'

Running: git remote add tainted-ref https://github.com/Ladsgroup/tainted-refs
Duration (seconds): 0.0031194686889648438
stdout:

stderr:

Running: git push -f tainted-ref tainted-ref:master
Username for 'https://github.com': Ladsgroup
Password for 'https://Ladsgroup@github.com': 
Duration (seconds): 25.103217840194702
stdout:

stderr:
Everything up-to-date

One idea to consider is to actually bring back packages we maintain (data model and data model services) to Wikibase monorepo and make them automatically synced using this tool ^. Doing so (and only for the initial merge of repos) requires push right (with forging author) in gerrit which we don't have but it's not hard to get temporarily. I would like to talk to releng about this in general. So, we can start by bringing back a package instead of ripping apart lib (I assume the former would be easier than the latter).

I think I'm done investigating this. Any review, comments, profanity, etc. is welcome.

Interesting, thanks. Yes, bringing back some of our libraries into the monorepo was one of the things I assumed we this would do as part of this. But I’m not sure if we need this continuous re-export into a separate repository at all? Can’t we publish npm or composer packages directly from a subdirectory of the Wikibase repository?

In contrast, the addwiki makes a new "mono-repo sync update" commit that doesn't have much info in it, here's an example: https://github.com/addwiki/wikibase-api/commit/65c4db0db4b4112f260cfe46a9813d8fc9eba9d8 (the bash file that generates the commit). To be honest, I don't like this amount of losing git history (with git shallow clone and auto generated commit) specially when you compare it with with symfony does with git split.

Yup, I'm not happy with how it is there yet either.

One idea to consider is to actually bring back packages we maintain (data model and data model services) to Wikibase monorepo and make them automatically synced using this tool ^

IMO this would reduce development friction, while not loosing any of the benefits that we have of having separate packages, and this is one of the main reasons I started looking into monorepos. (especially as we want more libs not less probably).

This script can probably pretty easily be run on Jenkins post merge too, rather than running on the Github mirror, but also it probably doesn't need to.

Another thing to consider here currently is mediawiki-vendor, and how to have package dependencies work between extensions, and be defined in composer.json file, but also have that code included in one of the git repos? etc.

My biggest worry is testing, lots of github repos can build their testing for themselves but in our infra, it might get complicated.

This is also one of the parts I ended up getting stuck on and would have had to spend more time thinking about.

Examples:

  • If you are fixing a phpdoc common in Wikibase extension, you don't really want CI for the other 8 libraries to needlessly run?

Ultimately this ends up needing a dependency graph? and knowing what tests need to run when certain things change?

Interesting, thanks. Yes, bringing back some of our libraries into the monorepo was one of the things I assumed we this would do as part of this. But I’m not sure if we need this continuous re-export into a separate repository at all? Can’t we publish npm or composer packages directly from a subdirectory of the Wikibase repository?

npm packages yes, composer packages no.
A composer package must be its own git repo with composer.json at the root of that repo

https://github.com/composer/composer/issues/2588#issuecomment-32260091

Having said that, there are obvious many explorations around getting around this current requirement, eg https://github.com/andkirby/multi-repo-composer

Interesting, thanks. Yes, bringing back some of our libraries into the monorepo was one of the things I assumed we this would do as part of this. But I’m not sure if we need this continuous re-export into a separate repository at all? Can’t we publish npm or composer packages directly from a subdirectory of the Wikibase repository?

Plus what Adam said, the reason would be to have easier re-use and discoverability. It's hard to find "data bridge" in Wikibase unless you know the repo very well. Most importantly, if I want to run or use or just study the code in my system, I can just pull it down from the minion repo instead of cloning a rather large repo that would put lots of burden on IDEs

Examples:

  • If you are fixing a phpdoc common in Wikibase extension, you don't really want CI for the other 8 libraries to needlessly run?

Ultimately this ends up needing a dependency graph? and knowing what tests need to run when certain things change?

I think doing what we are doing similar to our microfrontends would be fine. Running tests needlessly happens all the time (if you change a doc block, ALL database, composer, browser tests are being ran). The important thing to note is that composer tests (unless dealing with db, elastic or scribunto engine which I don't think are packages will do) are pretty fast and won't take much to run

From @gabriel-wmde Javascript mono repo tools (from a recent talk he attended). I haven't looked at these at all but figured I should include them in this ticket.

I’m trying to understand how we would work with these packages. For Data Bridge, I use terminals with …/client/data-bridge/ as the working directory, so I can directly run npm test etc. I hope that would also be possible for composer packages, and that the restriction…

A composer package must be its own git repo with composer.json at the root of that repo

https://github.com/composer/composer/issues/2588#issuecomment-32260091

…only applies when you want to publish a package? (And presumably, when I work in a subproject like that, it only has access to classes in its own vendor/ directory, not adjacent classes, so that the packages are still reasonably separate even though they’re versioned in the same monorepo.)

Copy over from Meeting Notes:

SummarySeems like a good idea. Symfony-like approach (git subtree split) seems most promising.
Risks, threats, challenges identifiedCode developed in a monorepo (e. g. Bridge, TR) has longer CI times than code developed in a separate repository (e. g. Termbox).
Opportunities noticedWe can have the development convenience of keeping the code in one Git repository, while making the subcomponents more discoverable via their extracted subtrees and publishing packages from there.
Other remarks