Page MenuHomePhabricator

Merge pagelinks, templatelinks and imagelinks tables
Closed, DeclinedPublic

Description

These 3 tables (pagelinks, templatelinks and imagelinks) do basically the same thing, and could be merged relatively easily[1]. The only difference between them (aside from the prefix on the column names) is that imagelinks doesn't store the destination namespace because it doesn't need to.

Reasons for merging:

  1. Simpler, cleaner DB schema.
  2. A large amount of the code that references these (e.g. in Special:Whatlinkshere or BacklinkCache.php) is waaay more complicated than it needs to be - this would be mitigated if a single table were used.
  3. It is more future-proof - we can't keep adding a new table each time there is a new link type we want to report on.
  4. It would open the door for extensions to extend this functionality, e.g. if there are new types of links that they create.

The schema would basically be the same as for pagelinks, but with an additional 'linktype' column. This column would need to be defined such that extensions can add new types, which probably means VARCHAR.

With regards to point #4, I have a use-case for this. My WikiDB extension [2] adds database functionality to the wiki, such as tables (defined in a table namespace), <data> tags (which define the data that goes in them) and <repeat> tags (which are, effectively, queries). When viewing 'what links here' for the table definition, users would expect to see included any pages which display the table's data (via a <repeat> tag). The way to implement this in the extension would be to add the relevant entries into this new 'internallinks' table with a linktype of 'wikidb_data'. This, plus the relevant i18n messages (keyed to the linktype), would mean that the Special:Whatlinkshere page could not only show these links and mark them as 'data', but (probably with no extra work) the option to show/hide this type of link could be included in the toggle links at the top.

I think this is probably a technically simple change, with the challenge being around the migration path for existing wikis (especially the big ones!).

Note that I haven't included the various other link tables in this list (e.g. categorylinks). There is some overlap, but they seem to have different data requirements and use-cases, so are probably not candidates for merging.


[1] Theoretically at least. I am aware that the implications of migrating e.g. en.wikipedia might be prohibitive to actually making this change, but I hope it can be discussed a bit more thoughtfully rather than simply being shut down on those grounds.
[2] http://www.kennel17.co.uk/testwiki/WikiDB

Event Timeline

EddieGP subscribed.

Before we go further into this I think some feedback with "Yes, this would be senseful, go ahead and discuss" or "No, the new table would be much too large to maintain, decline this" would be helpful. My guess is that this won't scale for large wikis (like WMF ones), as it is not only the migration to this new schema which takes up resources but also changes to the schema of that (then much bigger) table later on that would become much more difficult (an issue we already have on some tables and try to remove afaik). I don't know how big the tables are we're talking about and if merging those would create a problem, DBA should know best about if this is possible without scalability problems.

You added DBA, which means you want our perspective from the point of view of being in charge of a site with high scalability and performance requirements (e.g. enwiki, commons, wikidata). I have absolutely no saying on the functionality you mention for other uses or scenarios (I am not a developer, and I do not know mediawiki that well). So bear in mind when I give you my personal opinion on the topic (but I think that is what you want):

The current *links tables are horrible for performance in the current and future state of large wikis (see slides below for examples), but if we had to do something about them, we would make them smaller and more numerous (not the other way around). In the current state, the tables can take hundreds of GB in monolitical tables, which makes them really really painful to read and write, but also to backup and reimport. Creating a megatable would make the problem worse and I would definitely oppose to that without any doubt. I have rised my concern about the other mega-table (revision) and it is going to be split into smaller tables to not completely solve the problem, but at least delay it for several years.

The biggest problems with the *links tables are that they store titles in a very inefficient way. I have proposed several times that with a small change (not a large reachitecture), we could solve the issues with them, which is create on core a titles table (for both existing pages and potential/deleted pages in an append only table), and normalize not only page, but all *links tables. That would reduce the storage per wiki hugely, and make queries significalntly faster than repeating the same title in string form several times. I expresed that and other potential solutions to that starting at page 13

on the last mediawiki developer's conference, but most people were interested on discussing revision (which, makes sense, as it is more "core" and more important at the time).

Now, the idea you introduce is not bad per se, a generic "reference" service, where you can get list of things pointing/using a title would not be a bad idea, but if done on mysql or similar relational tables, we would split it in several tables through sharding (the same way we have external services for our parsercache or external storage, where there is an internal api that just see as key-value store, but they are internally sharded by a certain key). The problem is not that, the problem is converting current range searches into a more efficient key-value model (if you can do that, the storage model is trivial because we can shard or do whatever we want for performance optimization).

Just to say that I agree 100% with Jaime on this. Having another mega table (we already have revision and some others) would make our (DBA) life a lot worse instead of improving it (if such a refactor is going to happen), as operating and maintaining such tables is quite painful by now.
So the first approach of "merging" all of them into one can be very problematic. If we find a way to split it into several tables (with the same structure) as Jaime purposed, that would be a lot better and we can still achieve the goal of making it a bit cleaner for everyone!

This is probably going to be superseeded by decisions at T222224, so I am thinking of declining this.

  1. Simpler, cleaner DB schema.

The current tables are quite simple and straight-forward. I do not see a complexity argument here.

  1. A large amount of the code that references these (e.g. in Special:Whatlinkshere or BacklinkCache.php) is waaay more complicated than it needs to be - this would be mitigated if a single table were used.

There is indeed significant technical debt here. I also agree that a single table (with incrementing primary key) would make pagination much simpler. However, adding a primary key to the current tables would also get us most of the way there.

The cost of managing a query for N tables seems relatively low (happens once, and very rarely changes).

On the other hand, it seems quite expensive to migrate our current schema (compat breaking, extension queries, intermediary stages to be developed during the migration, awareness for third parties, upgrade scripts, the Toolforge community etc.). There is also to be considered the long-term impact of demanding each link table has the same schema, and the long-term cost of these being stored as a single table (such DBAs have already expressed would be undesirable).

  1. It is more future-proof - we can't keep adding a new table each time there is a new link type we want to report on.

I think we totally can. The mere existence of a table is relatively cheap, negligible even. If it gets to a thousand that would start causing some pain points (e.g. documenting them all, in-memory mapping of names starts within PHP.). But we're a long way from that.

  1. It would open the door for extensions to extend this functionality, e.g. if there are new types of links that they create.

I disagree that it would open doors. Extensions can already create their own tables. It would indeed make such endeavour a bit simpler. However, as it stands, there is not even in MediaWiki a generic concept that is shared between and abstracted over all these different kinds of links. Which means even if extensions can use this mega table, it would still be a huge amount of work to integrate it everywhere because MW doesn't make this concept itself extendable (new special pages, new API modules, template propagation, interface messages, purge handling, etc. etc.)

An abstraction for that I think would be very valuable and might interest you instead. Where each implementation then stores its entries, could remain separate.

This is probably going to be superseded by decisions at T222224, so I am thinking of declining this.

Agreed. I'll glad people are interested in this. And there is indeed a lot of technical debt and room for improvement in the link tables architecture. Feel free to continue discussing it here, and on related tickets.

For now I'll decline this as the proposed benefits appear unconvincing to both DBAs and to developers (myself). I believe this change would not enable new possibilities in our movement that aren't already equally possible/impossible today. As such, I hope this won't be too disappointing. If there are indeed new ideas that require a change like this, please do mention that. For example, by referencing this task in another where that new idea is discussed. We can always revisit this :)

  1. Simpler, cleaner DB schema.

The current tables are quite simple and straight-forward. I do not see a complexity argument here.

The complexity refers to the number of different tables doing the same thing. A single link table would, by definition, result in a cleaner schema.

However, whilst I don't agree with a number of your points in the abstract, I can understand the concrete issues relating to migration, backwards-compatibility and database administration. Therefore, I reluctantly accept that the task as stated is unlikely to be something that can be reasonably implemented in MediaWiki without a much stronger use-case, which at the moment I don't think exists.

I have therefore logged bug T228385 to explore alternative solutions to making the WhatLinksHere page extensible, which was my primary motivating factor for this ticket.

The complexity refers to the number of different tables doing the same thing.

The serve the same purpose, but they don't do the same thing.

A single link table would, by definition, result in a cleaner schema.

In general, the opposite tends to be true: a "clean" fully normalized schema will have more tables with a more narrow purpose each. Clean schemas tend to be more complex, simple schemas tend to be less clean, using optional or "polymorphic" fields instead of relationships.

I have therefore logged bug T228385 to explore alternative solutions to making the WhatLinksHere page extensible, which was my primary motivating factor for this ticket.

Ah, I see - the desire is to have a simple way for extensions to add relationships between pages which would "just work" with other parts of MediaWiki. That's an idea worth exploring. And one solution may be to have a generic "page_assoc" table that could serve this purpose - I'd not merge existing tables into that, though. Another approach would be to have a registry for such association tables, and let each extension define their own. That way, thinks like SpecialWhatLinksHere and LinksUpdate and RefreshLinks could include all such tables in their operation automatically.