Page MenuHomePhabricator

Investigate Doctrine DBAL usage possibility
Closed, DeclinedPublic

Description

Motivation
wb_terms is logically an isolated table from all other mediawiki or wikibase functionality.

Once we got to the point of refactoring our code to interact with the new schema, we thought we might as well try to decouple that logic into a separate library.

Since that library will have to interact with a database, and to avoid coupling it back to things that it does not need to depend on really, we want to investigate whether we can as well decouple on DB level interaction.

Doctrine DBAL is our choice for the benefits listed below.

This task is negatively concluded if any of the following fail:

  • (POC) we have experimented with the decoupled solution, using Doctrine DBAL, and tested it in isolation. [ Github Repo and related to that T220155]
  • (RFC) we have informed, collected feedback and got approval on using a DBAL other than MW one in production from all concerned parties at WMF.
  • (MVP) we have integrated a minimal production-ready implementation (limited to Property Terms) of the decoupled solution into Wikibase and tested it on beta and test environments. (follow on T219121)

If the task concludes positively, we go ahead and continue with the decoupled Doctrine DBAL based solution.
If the task concludes negatively, we continue with MW DBAL based solution.


Benefits of introducing Doctrine DBAL

Apart from the obvious improvements on the architectural level that we gain from building a decoupled solution, going with that to DBAL level and use Doctrine DBAL instead of MW one has the following benefits on the long term.

  • Doctrine is quite popular project, and many developers have worked or would be more willing to learn it than any custom homemade framework. This reduces both learning curve steepness and knowledge friction for new comers and volunteers.
  • Doctrine is a well design, well maintained and well documented open source DBAL solution.
  • Doctrine already supports multiple platforms, including but not limited to the ones supported by MW.
  • Much faster and fully isolated tests (~50 ms to run all tests)

Possible drawbacks of not using MW DBAL

  • extra maintenance to support specific and significant functionality implemented in MW DBAL? (please list if you know any)
    • <specific significant functionality we cannot live without #1>
    • ...

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 501119 had a related patch set uploaded (by Alaa Sarhan; owner: Alaa Sarhan):
[mediawiki/core@master] Experimental - DNM ever - Access mysqli instance to run an experiment

https://gerrit.wikimedia.org/r/501119

Change 500837 had a related patch set uploaded (by Alaa Sarhan; owner: Jeroen De Dauw):
[mediawiki/extensions/Wikibase@master] WIP - term store integration

https://gerrit.wikimedia.org/r/500837

Change 501119 abandoned by Alaa Sarhan:
Experimental - DNM ever - Access mysqli instance to run an experiment

Reason:
Risky .. just for sharing purposes

https://gerrit.wikimedia.org/r/501119

In the changes above, you can see what it took use to get a DBAL connection from a MW connection.

The mediawiki change is abandoned (to avoid any confusion or risks) and is there just for sharing purposes.

The current integration works locally against mw database, which means we can make it work for production too.

Just a heads up...

Ops/SRE may have some input, but you really need to be speaking to TechCom and possibly going through the RFC process if you want to actually take this into production (before you spend too much time on something that could be prevented from happening)

Then there's potential performance and security review too before deployment

@Marostegui @jcrespo

This one will also benefit a lot from your feedback :) The two open points in the checklist in the description.

Just a heads up...

Ops/SRE may have some input, but you really need to be speaking to TechCom and possibly going through the RFC process if you want to actually take this into production (before you spend too much time on something that could be prevented from happening)

Then there's potential performance and security review too before deployment

Thanks @Reedy for the heads up! I tagged TechCom on it and directly messaged Ops to get some input

Just a heads up...

Ops/SRE may have some input, but you really need to be speaking to TechCom and possibly going through the RFC process if you want to actually take this into production (before you spend too much time on something that could be prevented from happening)

Then there's potential performance and security review too before deployment

I just want to say this is not yet decided at the team level so if we want to actually implement it, then it needs to go through the process. For now, I would ask for TechCom, performance, security and SRE's input on why they think we should or should not use doctrine,

alaa_wmde updated the task description. (Show Details)

Security hi, would be great if someone can follow up with us on this regarding the points in the checklist marked by [needs-wmf-feedback]

Just to clarify, please tell me if this is correct: this ticket is about using DBAL to make queries during web requests involving user input. As opposed to the use of DBAL in the context of T191231: RFC: Abstract schemas and schema changes], where it is used only to generate SQL during maintenance.

Just to clarify, please tell me if this is correct: this ticket is about using DBAL to make queries during web requests involving user input. As opposed to the use of DBAL in the context of T191231: RFC: Abstract schemas and schema changes], where it is used only to generate SQL during maintenance.

Yes and also judging by the POC that @JeroenDeDauw put together at https://github.com/wmde/wikibase-term-store it also involves doctrine connections (like this)

So this raises the question if we want to have a) more than one methods of connecting to databases, and b) more than one method of abstracting queries in the code base. The latter seems ok at a glance, at least for extensions. I'm worried about the former. Using DBAL conenctions would need to integrate with the existing LoadBalancer code, so it behaves consistently. And even if it does that, this may lead to multiple master connections being open, with multiple concurrent transactions per request, with the potential for deadlocks.

We'll want to look at this very closely indeed. It seems like asking for trouble.

@daniel thanks for the quick response!

@Ladsgroup is this something you can help us understand and maybe test somehow? locally it is anyway tricky, but I wonder how then we would check some assumptions we might come up with. any ideas?

As an alternative to using Doctrine to conenct to the database, consider implementing Doctrine\DBAL\Driver\Connection based on a Wikimedia\Rdbms\Database instance. That way, you can still user doctrine's query builder, etc, without the need of duplicating connections, and still benefiting from MediaWiki's load balancing mechanism.

To be honest, I don't see how anything that makes its own database connections bypassing the LoadBalancer config would get approved for production use.

Another option would be to go the other way around and make a Wikimedia\Rdbms\Database implementation based on a Doctrine\DBAL\Driver\Connection. That could also work, but calling code would then need to check that it'S getting the right kind of Database object from the load balancer, and production config would need to be changed to use Doctrine style connection specs. That's not totally out of the question, but still seems unlikely to happen.

So, a MediaWiki DBAL driver is probably the way to go. Or maybe I gout confused and completely missed the point of it all. I don't have access to the benefits/drawbacks doc.

Oh, that's what https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/500837/5/lib/MediaWikiDoctrineDBALDriver.php does. Never mind me, then. As long as all connections are controlled by MediaWiki's load balancer, it's all good.

Edited to clarify: "all is good" was to mean "this addresses my concern about separate database connections", not "go right ahead with this". Other concerns need to be looked at, and other people need to chime in.

I'm super happy to hear that. Thanks for the feedback @daniel

I'm not sure that going through the mediawiki load-balancer is the only concern with starting to use a database abstraction layer.

While I understand very well why this can be attractive, I have had bad experiences with DBALs creating sub-optimal queries, and with the fact it's very hard to optimize them then for a production environment. I have no experience with Doctrine specifically though, so I'd like to take a look at your document about advantages/drawbacks, if you grant me access :)

I guess our DBA team might have other concerns too.

I'm not sure that going through the mediawiki load-balancer is the only concern with starting to use a database abstraction layer.

I didn't mean to say that it's the only concern - rather, not going through the load balancer would be an immediate show stopper. Other concerns should be investigated, of course.

I have had bad experiences with DBALs creating sub-optimal queries, and with the fact it's very hard to optimize them then for a production environment

Was that with the ORM layer, or something else? I would indeed recommend against using ORM for anything non-trivial.

I'm not sure that going through the mediawiki load-balancer is the only concern with starting to use a database abstraction layer.

While I understand very well why this can be attractive, I have had bad experiences with DBALs creating sub-optimal queries, and with the fact it's very hard to optimize them then for a production environment. I have no experience with Doctrine specifically though, so I'd like to take a look at your document about advantages/drawbacks, if you grant me access :)

Hey Joe, just to clarify, we are but all means necessary are not taking about ORM here in any shape or form.

I guess our DBA team might have other concerns too.

While I agree with Daniel and others that the use of the MediaWiki db connection/load balancing layer is an absolute minimum requirement, there are quite a few other potential problems that could affect the security/privacy, reliability or maintainability of our data and services, if Doctrine is to be used to access MediaWiki's existing databases in any way (it's definitely easier if done in separate, not connected database clusters). However this ticket so far is very sparse on details, and we don't have the information we need to make an informed decision. I've requested access to the linked document yesterday, but so far it wasn't granted yet. Alternatively, could this perhaps be replicated here on Phabricator so everyone involved can build an informed opinion? Thanks. :)

I don't think Daniel's comment should be regarded as the go-ahead from WMF's perspective on production use, and in general I believe there should be more information as well as time for people to be able to chime in ("at first glance..."). I will update the edit summary to reflect this.

I don't think Daniel's comment should be regarded as the go-ahead from WMF's perspective on production use, and in general I believe there should be more information as well as time for people to be able to chime in ("at first glance..."). I will update the edit summary to reflect this.

Indeed. Edited to clarify. I think the idea has merit, but we'll want to be really careful not to break our databases.

Put it back In Progress due to more concerns being raised

We will add more context and information to this task, addressing some of the concerns raised above to help with getting a proper feedback!

@daniel @mark @Joe @Reedy

Tagging you as I have updated task description with some context and our motivation. So far, we do not believe there's anything significant to gain or lose performance-wise when it comes to comparing Doctrine's DBAL with MW DBAL.

It would be very much appreciated if you add your functional concerns of not using MW DBAL below the headline in the task description so that we can experiment with them one by one and get back to you with our findings.

For non functional concerns, such as security, maintainability or reliability, please let us know if we can we help reach conclusions. We would be super happy to contribute to that and see it moving too!

alaa_wmde updated the task description. (Show Details)
alaa_wmde updated the task description. (Show Details)

For non functional concerns, such as security, maintainability or reliability, please let us know if we can we help reach conclusions. We would be super happy to contribute to that and see it moving too!

Please define "functional" vs "non functional" here...

Change 501119 restored by Alaa Sarhan:
Experimental - DNM ever - Access mysqli instance to run an experiment

https://gerrit.wikimedia.org/r/501119

I’m not really enthusiastic about using Doctrine in general. The task description lists four benefits, but I don’t find them particularly convincing:

  • Doctrine is quite popular project, and many developers have worked or would be more willing to learn it than any custom homemade framework. This reduces both learning curve steepness and knowledge friction for new comers and volunteers.

This needs to be balanced against the added cost where experienced MediaWiki developers will have to learn to work with this external framework. (Or perhaps I’m the only one not familiar with it? Who knows.) Also, I think it’s fairly common that new developers will work not only on Wikibase, but also on other MediaWiki/Wikimedia things, so they might still have to learn to work with MediaWiki’s database framework in that case (unless there are plans for a much larger migration on the horizon). Overall, I’m not convinced this is a net benefit when considering that the new term storage layer will not live in a vacuum.

  • Doctrine is a well design, well maintained and well documented open source DBAL solution.
  • Doctrine already supports multiple platforms, including but not limited to the ones supported by MW.
  • Much faster and fully isolated tests (~50 ms to run all tests)

All these sound to me like relevant points when deciding between DBAL providers, but not very relevant when deciding whether we should use it at all or not. How often do you expect to run the tests of this library?

I also want to point out, in response to this –

Apart from the obvious improvements on the architectural level that we gain from building a decoupled solution

– that the MediaWiki database framework is already fully decoupled from MediaWiki, except that it still lives in the same source code repository. As far as I know, you can totally use it without MediaWiki if you want. (And I hear it might be extracted into a separate repository in the future, too.)

More to the point, though, I am very puzzled why Doctrine is being considered specifically in the context of the wb_terms work. As far as I’m aware, Doctrine is mainly famous for its ORM system; however that’s not at all what we want: this is a problem that’s very close to the database layer, and there’s already been a lot of investigation into how to address it, most of which was discussed in SQL terms (though not necessarily in SQL syntax). As far as I understand, we don’t want an ORM to manage the schema for us, we want to stay in control of it. But is the DBAL on its own really that much more valuable compared to MediaWiki’s?

Skimming through the experiment repository, it looks like Doctrine is used fairly similarly to MediaWiki’s own database framework (direct calls to insert or select data), but I’ve also found this snippet:

private function findExistingTextId( Term $term ) {
	$record = $this->connection->executeQuery(
		'SELECT wbx_id FROM ' . $this->tableNames->text() . ' WHERE wbx_text = ?',
		[ $term->getText() ],
		[ \PDO::PARAM_STR ]
	)->fetch();
	return is_array( $record ) ? $record['wbx_id'] : false;
}

Quite frankly, I’m horrified at seeing a query string being concatenated from variable sources like this. I’m pretty sure this wouldn’t be necessary in MediaWiki’s framework – the table name is a separate argument to selectField and similar functions, and the implementation checks that the name is valid and/or quotes it if required. If this is not possible and encouraged in Doctrine, then that’s a pretty significant drawback in my opinion.

daniel moved this task from Inbox to Watching on the TechCom board.EditedApr 10 2019, 8:07 PM

Moving this to "watching" on the TechCom board. Feel free to tag this with TechCom-RFC when you feel you want to start this on an RFC process. Note that RFCs don't have to be concrete proposals looking for approval. They can also be requests to explore possible solutions to a specific problem.

Stalled until we are done with MW DBAL implementation

Addshore changed the task status from Open to Stalled.Oct 31 2019, 10:04 AM
Addshore claimed this task.
Addshore subscribed.

Marking as stalled for now.
Will revisit this after tech conf and any discussions and outcomes that happen there.