Page MenuHomePhabricator

[timebox 12hrs] Outline a possible way of storing shared Repo-Client logic inside WikibaseClient extension
Closed, ResolvedPublic

Description

As a Wikibase maintainer, I would like to see a proof of concept for storing shared Repo-Client logic inside WikibaseClient extension, as well as know the benefits and risks/challenges of this approach, so that I can determine and judge whether this is an appropriate mid-term solution or not.

There has been an idea suggested to move all logic shared between Repo and Client (currently and Lib and "data access") to Client.

This is not meant to be a permanent solution, but a pragmatic trick to simplify the complexity. Overall goal is to make Repo and Client completely independent from each other.
The shared logic would be put in the a dedicated section of Client, and there will be infrastructure (e.g. "structure test") disallowing Repo using any logic from client apart from the "shared" package.

The idea is pretty rough draft so far. In order for the team to be able to elaborate and reason about this idea in more detail, it would be good to have some more specific draft of the suggestion. This way the team could weigh the benefit and challenges/risks, and define a more detailed path forward

Acceptance criteria:

  • Draft (free form - not the complete code implementation) of the proposed approach exists and is linked from this task
  • List of benefits and challenges/risks for the proposed solution is created and linked from this task

Possible aproach

  • One possible form of the first AC would be a Gerrit change, though a proof-of-concept one that wouldn’t be merged. In order to keep that change small, we would not move code outside of the lib/ directory or the \Wikibase\Lib namespace yet, only move the registration of that code from WikibaseLib to WikibaseClient.
  • There could also be a draft of an ADR describing the intended solution (possibly coming together with the POC code change described above)

Timebox: 12 person hours for the team to distribute

Event Timeline

WMDE-leszek renamed this task from Outline a possible way of storing shared Repo-Client logic inside WikibaseClient extension to [timebox 12hrs] Outline a possible way of storing shared Repo-Client logic inside WikibaseClient extension.Jun 10 2020, 2:31 PM
WMDE-leszek updated the task description. (Show Details)

I just want to add there are two approaches doable:

  • Move all of lib code to client and then make repo depend on client
  • Dehydrate lib though taking things out to composer package and move the rest (i18n messages, RL modules) to client

Change 606271 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] [POC] Load WikibaseClient extension in WikibaseRepo

https://gerrit.wikimedia.org/r/606271

I will try to turn this to a draft ADR:

Reasoning

The idea at first seems counter-intuitive given that the data flow is actually the other way around (clients depend on repo's data), this is actually not 100% true. The repo depends on client to update itself when there's a page deletion or page move (so the data flow goes both ways).
Most repos (and all repos in WMF infra) are client of themselves as well. Meaning we can see a wikibase setup in another way: We have clients and some of them can get "upgraded" to a repo.

Pros

  • It simplifies the technical schema and architecture of Wikibase. Right now, we have repo and client working independently but they talk to each other through lib/, data-access/ and several other composer libraries like data model and data model services. The boundary between these shared logic parts are not clear and lib/ has basically turned to a kitchen sink holding any useful or not useful logic that's shared between client and repo. Something like "just one dependency" improves the onboarding experience.
  • It avoids collision in logic in a "both repo and client" mediawiki instance (like Wikidata).
    • For example, if we want to register an API module in client and repo registers the same thing again (which happened before: T115117), can cause issues or at best confusion. Having repo depending on client makes these kind of collisions easier to detect and fix.
    • Another case is shared configs, as an example we had to define two configs for migrating to the new term store (one for clients and one for repos), they had typos, one got fixed and the other one was missed for months and caused two large-scale outages. This could have been avoided by storing all shared configs in one place.
  • In some cases, it makes the logic in the code simpler.
  • Some installations can be simpler, currently if you want to install a setup similar to wikidata.org, you need to install two different extensions that they separately install a couple more extensions like lib as well. Specially, since lib and view are not just php codes that we could move to composer libraries, they register RL modules and i18n messages.

Cons

  • It makes repo more tightly to client coupled. In software design, the aim is to have components that are "highly cohesive and loosely coupled" (Form "Growing Object-Oriented Software, Guided by Tests" book). While you can argue it would make the client more cohesive (and eliminates incohesive components like lib), it definitely couples repo and client tighter. Currently, client code has not been used directly by the repo and with this change, repo will start to depend on lots of parts of client.
  • It destroys the API between client and repo. The API (in broad sense of the term) between client and repo is not clear, it's lib/, the composer libraries, mediawiki core but it's at least to some degree well defined. This change completely destroys that notion and every part of client can be a place for repo to latch on and couple to.
  • It makes having a plain repo impossible. In order to install a repo, you need to install the client first.
  • It's the exact opposite direction of the data flow. While it's already mentioned that the data flow goes both ways but the most of the data flow is from repo to client. This makes this change counter-intuitive and unconventional. I can't find any example that does this.
  • It reduces modularity of Wikibase.

One idea to think about is dehydrating php code (aka the logic) out of lib and client through moving them to composer libraries and then making repo depend on the client (mostly to have shared configs, i18n messages and other couplings to mediawiki core). This to some degree addresses some cons mentioned above but it still doesn't reduce the coupling problem (unless we make sure repo only depends on several composer libraries and not all of client) and it doesn't address the installation issues too.

I am still not quite sure that the fact that most repos are also clients justifies this coupling. If this is true, then I'd like to maybe revisit the question of why they were split in the first place? What was the guiding architectural principle in keeping them separate from each other, and where, architecturally speaking, do they fail to be seperated (which is probably the subject of the coupling investigation)?

For example, if we want to register an API module in client and repo registers the same thing again (which happened before: T115117), can cause issues or at best confusion. Having repo depending on client makes these kind of collisions easier to detect and fix.

Can't this easily be mitigated using checks to see if the other extension is registered or, at a lower level, if the API module (or any other piece of logic that is global and verifiable) itself is registered?

Another case is shared configs, as an example we had to define two configs for migrating to the new term store (one for clients and one for repos), they had typos, one got fixed and the other one was missed for months and caused two large-scale outages. This could have been avoided by storing all shared configs in one place.

Generally and theoretically speaking (as I am not yet fully familiar with the circumstances), could this have also been avoided if it was only one of the extensions' (and not both) responsibility to directly interact with the db?

In some cases, it makes the logic in the code simpler.

In which cases? examples here would maybe clarify this need to me (personally)

Otherwise I agree with the pros presented here, registering one monolithic extension is definitely easier than having two separate extensions (and an additional two pseudo extensions). However, it is worth noting that, at least from my personal perspective - and without knowing the code base too deeply yet, a change like this is definitely counter - intuitive.

I am still not quite sure that the fact that most repos are also clients justifies this coupling. If this is true, then I'd like to maybe revisit the question of why they were split in the first place? What was the guiding architectural principle in keeping them separate from each other, and where, architecturally speaking, do they fail to be seperated (which is probably the subject of the coupling investigation)?

I think it was started with the obvious solution (Wikidata being the center of truth and Wikipedias depending on it) and then slowly the product evolved in interesting ways, for example community wanted to have sitelinks removed once the article is deleted in Wikipedia which means the exact opposite of the flow And then federation came and there's no one source of truth anymore. That being said, I'm not 100% sure this is the solution either. I'm mostly gathering information and can't' decide on it (specially since architecture is not my strongest suit).

For example, if we want to register an API module in client and repo registers the same thing again (which happened before: T115117), can cause issues or at best confusion. Having repo depending on client makes these kind of collisions easier to detect and fix.

Can't this easily be mitigated using checks to see if the other extension is registered or, at a lower level, if the API module (or any other piece of logic that is global and verifiable) itself is registered?

Apparently there is no mechanism in core to avoid this.

Another case is shared configs, as an example we had to define two configs for migrating to the new term store (one for clients and one for repos), they had typos, one got fixed and the other one was missed for months and caused two large-scale outages. This could have been avoided by storing all shared configs in one place.

Generally and theoretically speaking (as I am not yet fully familiar with the circumstances), could this have also been avoided if it was only one of the extensions' (and not both) responsibility to directly interact with the db?

Client and repo have their own data access logic and it's not shared in one extension (specially since a repo wiki has a whole lot of different table than a client and client has tables too). But the problem of shared configs is not limited to database, it can happen on frontend too (this was mostly an example of a case where this problem caused lots of issues).

In some cases, it makes the logic in the code simpler.

In which cases? examples here would maybe clarify this need to me (personally)

I couldn't come up with a great example, the only thing I remember is the case of change dispatching, the change dispatching from repo to core has the special logic when it needs to dispatch to itself (since a repo can be a client of itself so as a client it needs to changes to be dispatched to itself too) with the new world order, it would simply don't look if client is installed or not, it assumes it's installed.

Otherwise I agree with the pros presented here, registering one monolithic extension is definitely easier than having two separate extensions (and an additional two pseudo extensions). However, it is worth noting that, at least from my personal perspective - and without knowing the code base too deeply yet, a change like this is definitely counter - intuitive.

I’m confused by the last three comments, and I think this draft ADR needs a section on what the proposed change is. I assumed the proposed change was to merge Lib into Client and make Repo depend on Client, but parts of these last comments sound like Repo and Client should be completely merged into one extension?

Change 607282 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] DNM: Merge WikibaseLib extension into WikibaseClient

https://gerrit.wikimedia.org/r/607282

Change 607282 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] DNM: Merge WikibaseLib extension into WikibaseClient

https://gerrit.wikimedia.org/r/607282

I made a hackier version in:

Change 606271 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] [POC] Load WikibaseClient extension in WikibaseRepo
https://gerrit.wikimedia.org/r/606271

[snip]

I’m confused by the last three comments, and I think this draft ADR needs a section on what the proposed change is. I assumed the proposed change was to merge Lib into Client and make Repo depend on Client, but parts of these last comments sound like Repo and Client should be completely merged into one extension?

That is definitely not my intention. I don't think everything should be one extension.

It looks like this is sufficiently outlined now, and that’s all that was required for this task.

Change 607282 abandoned by Lucas Werkmeister (WMDE):
[mediawiki/extensions/Wikibase@master] DNM: Merge WikibaseLib extension into WikibaseClient

Reason:
superseded by I1e65288f1e

https://gerrit.wikimedia.org/r/607282

Change 606271 abandoned by Ladsgroup:
[mediawiki/extensions/Wikibase@master] [POC] Load WikibaseClient extension in WikibaseRepo

Reason:
We decided not to go with this approach. It was a POC anyway

https://gerrit.wikimedia.org/r/606271