Page MenuHomePhabricator

Commons database is growing way too fast
Open, Needs TriagePublic

Description

s4 is now the largest section by a wide margin:

s8 (noch eine Kopie).png (810×1 px, 157 KB)

It has basically doubled since 2019 (while many other sections got smaller with the optimizations in place). Many optimizations we have done improved the storage (e.g. the image metadata storage change in late 2021) but with this rate of growth no optimization can prevent major issues in a year or two.

It's just three tables that are growing really fast: categorylinks, templatelinks, externallinks (in total, they are responsible for 800GB). The rest don't seem to be too problematic:

table (Kopie).png (1×2 px, 452 KB)

By a quick look, I think we can do some easy fixes and it should drastically reduce the database growth:

For externallinks:

  • Use interwiki links/pagelinks instead of raw https links.

For templatelinks (Most used templates):

  • Merge some templates that are both heavily used and only used by the same users
  • Use redirect target in heavily used templates. e.g. Template:Location_dec is used 5 million times, using the redirect target removes 5m rows from templatelinks (I can give the list of heavily used redirect templates)
  • Migrate some functionalities to software or parser functions, etc. to avoid having a dedicated template used in basically every page (e.g. https://commons.wikimedia.org/wiki/Template:Dir)

For categorylinks (Most used categories):

Templates and categories added by https://commons.wikimedia.org/wiki/Module:SDC_tracking might be contributing to the issue.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

One new and easy low-hanging fruit:

Remove template styles for Template:Information and cc-by-sa license: https://commons.wikimedia.org/wiki/Module:Information/styles.css and https://commons.wikimedia.org/wiki/Template:Cc-by-sa-layout/styles.css
Move them to common.css instead.
Why?

Yeah, why not https://commons.wikimedia.org/wiki/MediaWiki:Filepage.css ?

Also, what will happen when viewing the file page on another project that has defined its own MediaWiki:Filepage.css?

That’s a good question. As far as I can tell from looking at a random file on wikidata.org and on an external wiki with InstantCommons, both the TemplateStyles and Commons’ MediaWiki:Filepage.css are loaded on other wikis (Filepage code here), whereas its common.css isn’t loaded. So if we want to keep the behavior consistent, I guess MediaWiki:Filepage.css makes more sense. WDYT @Ladsgroup?

I guess one issue would be that {{Information}} can be used outside the File namespace, and should still look okay there. So we wouldn’t want to delete the TemplateStyles, but rather… make Module:Information only add them outside the File namespace, and manually keep the two versions of the styles (FilePage.css, /styles.css) in sync?

Maybe it would be worth to modify the parser so that for specific templates it can disable dependency tracking, and then apply that to Template:Information? It's used on basically every single file page, there is not much value in adding its internals to the tracking tables.

Hm, maybe… I think we could do that somewhere in LinksTable::update(), drop the link when writing to the table (probably simpler than changing the tracking in both parsers), perhaps depending on the tl_from_namespace. And then have a version of RefreshLinksJob that refreshes all pages in a namespace rather than all pages with a certain template, so that edits to Template:Information still update all file pages?

One new and easy low-hanging fruit:

Remove template styles for Template:Information and cc-by-sa license: https://commons.wikimedia.org/wiki/Module:Information/styles.css and https://commons.wikimedia.org/wiki/Template:Cc-by-sa-layout/styles.css
Move them to common.css instead.
Why?

Yeah, why not https://commons.wikimedia.org/wiki/MediaWiki:Filepage.css ?

Small wikis tend to copy-paste templates from larger wikis. If the templates on larger wikis use TemplateStyles, their copies will display error messages, reminding the user to import the stylesheets as well. If they use Common.css or Filepage.css, the importing users will have no idea why they look broken, so they will remain broken.

Maybe it would be worth to modify the parser so that for specific templates it can disable dependency tracking, and then apply that to Template:Information? It's used on basically every single file page, there is not much value in adding its internals to the tracking tables.

The stress is on basically. While almost all files use it, there are some that don’t, and those need attention. Currently there’s a manually (bot-) maintained complement category of it at https://commons.wikimedia.org/wiki/Category:Media_missing_infobox_template. If the parser was modified to create database rows for pages that lack the template rather than those that have it, that would be fine. (Although this inversion should probably happen in file namespace only, i.e. write DB rows for NS_FILE pages that lack it and pages in other namespaces that have it.)

One new and easy low-hanging fruit:

Remove template styles for Template:Information and cc-by-sa license: https://commons.wikimedia.org/wiki/Module:Information/styles.css and https://commons.wikimedia.org/wiki/Template:Cc-by-sa-layout/styles.css
Move them to common.css instead.
Why?

Yeah, why not https://commons.wikimedia.org/wiki/MediaWiki:Filepage.css ?

Small wikis tend to copy-paste templates from larger wikis. If the templates on larger wikis use TemplateStyles, their copies will display error messages, reminding the user to import the stylesheets as well. If they use Common.css or Filepage.css, the importing users will have no idea why they look broken, so they will remain broken.

I don't think it's reasonable to keep >100M rows in the database so small wikis have it easier in copy-pasting. FWIW, copying from Common.css is one of the very first things a small wiki admin do as it also includes css for infobox and other important pieces.

On whether it should be in Common.css or Filepage.css, It's fine either way IMO. The pros and cons of either of the choices are negligible.

I actually missed this:

That’s a good question. As far as I can tell from looking at a random file on wikidata.org and on an external wiki with InstantCommons, both the TemplateStyles and Commons’ MediaWiki:Filepage.css are loaded on other wikis (Filepage code here), whereas its common.css isn’t loaded. So if we want to keep the behavior consistent, I guess MediaWiki:Filepage.css makes more sense. WDYT @Ladsgroup?

Than yes, please move it to Filepage.css

Would it make sense to code MediaWiki to detect if a template has too many templatelinks somewhere (greater than $wgMaxTemplateLinks = 500,000 or something), and if so delete all templatelinks for that template and keep them deleted until it falls back below 500,000? Maybe a count field could be introduced somewhere to do the tracking, that stays incremented and decremented even after the templatelinks themselves are deleted. The count field could also be used to display a warning to the user on Special:WhatLinksHere and relevant pages that says "Too many transclusions. Unable to display." or something.

Move them to common.css instead.

A more permanent solution will be T200687: MCR support in TemplateStyles. Although after this copying (not import) templates between wikis will be no longer straightforward.

A solution to template (especially indirectly used ones): You can move them to MediaWiki namespace and call them via {{int:xxx}}. Note

FWIW, copying from Common.css is one of the very first things a small wiki admin do as it also includes css for infobox and other important pieces.

First, it’s not enwiki and not Common.css, so copying enwiki’s Common.css won’t import these styles. {{Information}} is probably imported once local uploads appear, not as one of the very first things. Maybe not even by an interface administrator, so even if they figure out that they should copy over Filepage.css from Commons, they won’t be able to do so.

Second, apparently not. A few true positives from https://global-search.toolforge.org/?q=fileinfotpl-type-information&regex=1:

The header cells have no blue background. The tables themselves used to have a gray background, but no longer have since the wikis have been switched to Vector 2022 (except for cawikisource, which is still Vector 2010).

A solution to template (especially indirectly used ones): You can move them to MediaWiki namespace and call them via {{int:xxx}}. Note

There are two more gotchas:

  • MediaWiki namespace pages are implicitly fully protected, and this protection cannot be lifted. This is probably not an issue for such basic templates like {{dir}}, but something to be considered.
  • Caching. On the one hand, due to the lack of tracking, pages using the template cannot be reparsed when the template is edited. On the other hand, using {{int:}} splits the parser cache by UI language, even if no language-specific subpage of that MediaWiki-namespace page exists. This may lead to a significant increase in parser cache storage usage.

MediaWiki namespace pages are implicitly fully protected

If there are consensus, we can give Commons templateeditors editinterface right, so that they can modify non-CSS/JS pages in MediaWiki namespace (other than those explicitly fully protected). Also see next point.

pages using the template cannot be reparsed when the template is edited

There are multiple types of Commons template, for example:

{{int:}} splits the parser cache by UI language

If you mean parser cache of file description pages, most of them are already split since they contains {{int:filedesc}}.

Some more specific templates (more to add):

I actually missed this:

That’s a good question. As far as I can tell from looking at a random file on wikidata.org and on an external wiki with InstantCommons, both the TemplateStyles and Commons’ MediaWiki:Filepage.css are loaded on other wikis (Filepage code here), whereas its common.css isn’t loaded. So if we want to keep the behavior consistent, I guess MediaWiki:Filepage.css makes more sense. WDYT @Ladsgroup?

Than yes, please move it to Filepage.css

I specced out this proposal in some more detail here: https://commons.wikimedia.org/wiki/Module_talk:Information#Proposal:_Move_template_styles_to_site-wide_CSS

There are templates like https://commons.wikimedia.org/wiki/Template:Geograph_from_structured_data that used structured data but used many templates inside, so a file with only such template used more than 40 templates and modules. They should be rewritten using one module.

Change 991921 had a related patch set uploaded (by Lucas Werkmeister; author: Lucas Werkmeister):

[mediawiki/extensions/WikimediaMessages@master] Use interwiki to link to Creative Commons

https://gerrit.wikimedia.org/r/991921

Regarding externallinks. Here are most linked domains: P49955

There are 135M links to creativecommons.org (I guess licenses). If they change to interwiki link (CreativeCommons exists in interwiki map maybe we should introduce cc). That would remove potentially one quarter of the whole extlinks. They are coming from Template:Cc-pd-mark-footer and stuff like MediaWiki:Wm-license-cc-pd-mark-link which are not too easy to fix but not too complicated either.

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/991921 should fix those, hopefully. (As far as I can tell, the “links” inside the templates themselves are only there for machine readable data, and are <nowiki>’d to not be actual links; the visible, clickable links and externallinks rows come from the interface messages.)

In T343131#9137830, @Sj wrote:

If switching from https to interwiki links gets a ~4:1 compression ratio; could both these and template links be further encoded down to a few-character redirect? maybe a companion table (or an R: namespace in WD ;)? Trading off space for query rewriting

It's a bit more complicated. e.g. for a link to wikidata.org/wiki/Q123, you get: https://org.wikidata. for el_to_domain_index and /wiki/Q123 for path. which are 21 and 10 characters respectively. But for interwiki links you will hold d (d is an already an interwiki shortcut for wikidata) for iwl_prefix and Q123 for iwl_title which are 1 and four chars, an eight times redaction. But you also need to take indexing into account.

I did a test for CC interwikis in particular (see previous comment) – I had a script (P55088) generate 1 million rows for links to https://creativecommons.org/licenses/by-sa/4.0/deed.en and its interwiki equivalent, then optimize the tables for good measure. The result:

$ sudo du -h /var/lib/mysql/T343131/{externallinks,iwlinks}.ibd
197M    /var/lib/mysql/T343131/externallinks.ibd
173M    /var/lib/mysql/T343131/iwlinks.ibd
$ sudo du -b /var/lib/mysql/T343131/{externallinks,iwlinks}.ibd
205520896       /var/lib/mysql/T343131/externallinks.ibd
180355072       /var/lib/mysql/T343131/iwlinks.ibd
$ mariadb --version
mariadb from 11.2.2-MariaDB, client 15.2 for Linux (x86_64) using readline 5.1

So in this case, iwlinks doesn’t save as much (probably in part because “creativecommons” is quite a long prefix – T343131#9061982 suggested introducing “cc”, though to me that looks like it could be confused with a language code :/), but it’s still an improvement. And in any case, Amir also pointed out that two medium sized tables are easier to maintain that one giant table.

So in this case, iwlinks doesn’t save as much (probably in part because “creativecommons” is quite a long prefix – T343131#9061982 suggested introducing “cc”, though to me that looks like it could be confused with a language code :/)

I suggest "cc-license" as a prefix. It's 5 characters shorter and IMO more readable.

but it’s still an improvement. And in any case, Amir also pointed out that two medium sized tables are easier to maintain that one giant table.

It can be even further improved, the indexes on iwlinks seems to be a bit of overkill. An obvious one is this:

INDEX iwl_prefix_from_title (iwl_prefix, iwl_from, iwl_title),

iwl_title at the end is not really needed, with iwl_prefix + iwl_from you get good enough row selection not to worry about the title. Of course, we should double check it before applying the change to production but generally speaking, I don't think that's useful.

Some more specific templates (more to add):

I am not a big fan of substituting complicated templates. Many templates used mostly by other templates are a bit like functions in in programing languages: they make code more readable and maintainable, and if an error is found or update is needed we can make a single change instead of tracking down every single use. But I am ok with substituting templates that are simple wrappers around lua modules or short templates

There was also question about Infobox_template_tag and License template tag: those are used for database queries or petscan queries for files missing those tags. I do not know if we could search the SQL database for files missing HASLISCENSETAG in file description. I suspect such query would timeout even for small groups of files.

https://commons.wikimedia.org/wiki/Template:Nowrap: a very very stable template. Proposal is use a bot to substitute all usages.

That will replace

{{nowrap|Foo bar}}

with

<span style="white-space:nowrap">Foo bar</span>

in wiki markup. This is far less user-friendly. New editors learning wiki markup have enough to concern themselves with, without having to also learn HTML and CSS.

Is this change really necessary?

I understand that use inside other templates is a different issue.

For License template tag, also note files without license tags are automatically added to https://commons.wikimedia.org/wiki/Category:Files_with_no_machine-readable_license by Extension:CommonsMetadata.

Similarly files without information template are in https://commons.wikimedia.org/wiki/Category:Files_with_no_machine-readable_source.

https://commons.wikimedia.org/wiki/Template:Nowrap: a very very stable template. Proposal is use a bot to substitute all usages.

That will replace

{{nowrap|Foo bar}}

with

<span style="white-space:nowrap">Foo bar</span>

in wiki markup. This is far less user-friendly. New editors learning wiki markup have enough to concern themselves with, without having to also learn HTML and CSS.

Is this change really necessary?

I understand that use inside other templates is a different issue.

Note VisualEditor can edit the second one, but not the first one (T52355: VisualEditor: Add support for editing templates' parameters as DOM elements ("visually"), including supporting nested templates).

For License template tag, also note files without license tags are automatically added to https://commons.wikimedia.org/wiki/Category:Files_with_no_machine-readable_license by Extension:CommonsMetadata.

Similarly files without information template are in https://commons.wikimedia.org/wiki/Category:Files_with_no_machine-readable_source.

Yes that extension does seem to overlap a lot with the purpose of License template tag. It is quite possible that this tag is no longer needed. I created it, before Extension:CommonsMetadata existed but I have not used it in last several years. However I do not know if others use it or not, since once you create a tool to accomplish something it is unclear if anybody uses it or not. I proposed to retire this template.

Some more specific templates (more to add):

I don’t like putting random {{#invoke:}} calls in templates. Wrapper templates usually have better documentation, they better integrate with tools like TemplateWizard and easier to handle to someone who doesn’t know Lua.

If I see {{#time:Y|now -95 years}}, I have no idea what it means. If I see {{Not-PD-US-expired-min-year}}, I know what it means. And the backlinks in non-file namespaces (mainly template and Commons namespaces) help updating relevant pages when we reach another row in https://commons.wikimedia.org/wiki/Commons:Hirtle_chart.

Some more specific templates (more to add): [...]

Those need an function that accepts arguments, not frame. For example, https://commons.wikimedia.org/wiki/Module:SDC_tracking needs "function p._SDC_statement_exist(args)", "function p.SDC_statement_exist(frame)" will not work.

[...]
I don’t like putting random {{#invoke:}} calls in templates. Wrapper templates usually have better documentation, they better integrate with tools like TemplateWizard and easier to handle to someone who doesn’t know Lua.

If you open up the "Edit template data" button above the editor in an template that only has an module wrapper, you will not have the option to let it import the arguments automatically. If you do the same with an wikicode template, that option will be there. (assuming neither templates have templdatedata) (not a bug request, btw.) Also, you can put templatedata on the modules doc subpage and it will work. Obviously any other documentation can go to the module doc subpage too. Keep in mind that keeping the wrapper is at the cost of millions of transclusions, that cost is just too great.

There are templates like https://commons.wikimedia.org/wiki/Template:Geograph_from_structured_data that used structured data but used many templates inside, so a file with only such template used more than 40 templates and modules. They should be rewritten using one module.

As one of 2 authors of that template, yes I totally agree that it would be better style to write this as a single lua code and I wrote similar lua modules, like Module:Neuchâtel fossil cast. If I was starting this project now I would have done it in the cleaner style. however, such rewrite, which does not add any new capabilities, but should not break any of 4.2 millions affected pages, is more in would-be-nice-to-have, than must-have, or unbreak-now category, and it might be hard to find volunteers to do it.

Note VisualEditor can edit the second one, but not the first one

Yes, but that's a fault in VE, which the template (and templates in general) predates by years. My point stands.

Those need an function that accepts arguments, not frame. For example, https://commons.wikimedia.org/wiki/Module:SDC_tracking needs "function p._SDC_statement_exist(args)", "function p.SDC_statement_exist(frame)" will not work.

In lua code you use "frame" to pass arguments from wikitext to lua and regular function arguments to pass data between lua functions. If we want to call SDC_tracking from other lua codes we would add "function p._SDC_statement_exist(args)" function to the module to allow that; however here we are talking about calling {{#invoke:SDC_tracking|SDC_statement_exist|...}} from templates instead of {{SDC_statement_exist|...}}, and that can be done with the code as is. The only purpose of Template:SDC_statement_exist is to provide an page with clear interface documentation for people that often do not know much about Lua. However as this template if mostly meant to be used by other templates we can probably assume that those users are familiar with both.

There are multiple types of Commons template, for example:

At some point Wikimedia developers created https://translatewiki.net/ and most of the localizations were moved from commons MediaWiki namespace to translatewiki, with some unknown mechanism linking MediaWiki namespace to translatewiki. Unfortunately after the big move was done there was no documentation on how to create new translatewiki pages and link to them from Commons's MediaWiki namespace or initialize transitions of additional pages. I wrote Template:PD-old-text in 2015. At a time I was trying to add the translations to translatewiki but never figured out how. If someone knows how please do it. There are many more translation templates which could be replaced with links to MediaWiki namespace and translatewiki.

There are multiple types of Commons template, for example:

At some point Wikimedia developers created https://translatewiki.net/ and most of the localizations were moved from commons MediaWiki namespace to translatewiki, with some unknown mechanism linking MediaWiki namespace to translatewiki. Unfortunately after the big move was done there was no documentation on how to create new translatewiki pages and link to them from Commons's MediaWiki namespace or initialize transitions of additional pages. I wrote Template:PD-old-text in 2015. At a time I was trying to add the translations to translatewiki but never figured out how. If someone knows how please do it. There are many more translation templates which could be replaced with links to MediaWiki namespace and translatewiki.

There are multiple options:

  • Simply move the template and localizations to MediaWiki namespace and do nothing more. You can now use int: to call the template in MediaWiki namespace and a translation based on user interface language will be automatically chosen. The translation can only be managed by admins or other users with editinterface right.
  • Similar to above but mark the page in MediaWiki namespace for translation. It can now be translated by any users (even without editinterface).
  • If you want to make it translatable in translatewiki, you need to add a message to WikimediaMessage extension.

Some more specific templates (more to add):

I don’t like putting random {{#invoke:}} calls in templates. Wrapper templates usually have better documentation, they better integrate with tools like TemplateWizard and easier to handle to someone who doesn’t know Lua.

If I see {{#time:Y|now -95 years}}, I have no idea what it means. If I see {{Not-PD-US-expired-min-year}}, I know what it means. And the backlinks in non-file namespaces (mainly template and Commons namespaces) help updating relevant pages when we reach another row in https://commons.wikimedia.org/wiki/Commons:Hirtle_chart.

This can be achieved by hidden comment such as <!--Not-PD-US-expired-min-year-->.

Something to consider in the trade-off between readability and performance is that for small usages or templates with 100 transclusions, it's fine to be as readable as possible but for a template with 10M transclusions, performance takes precedence over readability (and it's not editable by non-admins anyway). You can reduce the issue by using comments as mentioned above.

If there are many usages added by a bot (mass-uploads, etc.), that could turn into a template which adds a templatelinks row but you could turn the net change as negative by reducing template calls.

If I see {{#time:Y|now -95 years}}, I have no idea what it means. If I see {{Not-PD-US-expired-min-year}}, I know what it means. And the backlinks in non-file namespaces (mainly template and Commons namespaces) help updating relevant pages when we reach another row in https://commons.wikimedia.org/wiki/Commons:Hirtle_chart.

This can be achieved by hidden comment such as <!--Not-PD-US-expired-min-year-->.

I replaced Not-PD-US-expired-min-year with {{#time:Y|now -95 years}} in 2 templates, which I assume will have the most transclusions. I guess in a couple months we will see if transclusion level drops from the current 9.1M

There are multiple options:

  • Simply move the template and localizations to MediaWiki namespace and do nothing more. You can now use int: to call the template in MediaWiki namespace and a translation based on user interface language will be automatically chosen. The translation can only be managed by admins or other users with editinterface right.

In 2010 several hundred messages were moved from MediaWiki namespace to translatewiki, because of how hard it was for the translators to work with MediaWiki namespace. I also do not know how to "simply" move 54 translations from Template:PD-old-text , another 50+ from Template:PD-US-expired-text and probably bunch of other similar templates. The manual approach seems quite laborious and potentially error-prone.

  • Similar to above but mark the page in MediaWiki namespace for translation. It can now be translated by any users (even without editinterface).

I have never run into any pages in MediaWiki namespace managed by Translate extension, are there any pages using this feature I can look at?

  • If you want to make it translatable in translatewiki, you need to add a message to WikimediaMessage extension.

I do not know what that means . File a ticket at https://phabricator.wikimedia.org/tag/wikimediamessages/ or something. If anybody is familiar with this process please contact me.

If you open up the "Edit template data" button above the editor in an template that only has an module wrapper, you will not have the option to let it import the arguments automatically. If you do the same with an wikicode template, that option will be there. (assuming neither templates have templdatedata) (not a bug request, btw.)

And how does this come here? The question is not whether the template is implemented in wikitext or Lua, but whether the Lua implementation is called directly or through a wrapper template. Yes, it may not be possible to generate TemplateData for a wrapper template automatically. But it’s always possible to generate it manually.

Also, you can put templatedata on the modules doc subpage and it will work.

It depends on what you call “work”. It will be rendered without errors, and will be available through the API, and TemplateWizard will happily use it to generate {{Module:…}} transclusions. The only problem is that those transclusions won’t work, as it’s not how modules should be transcluded.

This can be achieved by hidden comment such as <!--Not-PD-US-expired-min-year-->.

Until the comment is removed or gets moved away. This is what clean code is for. On the other hand, this may be a good candidate for a MediaWiki message – we could solve both the problem of too many template links and T244663: Replace hard coded 1923 logic in UploadWizard with more than 95 years ago with the same message.

Something to consider in the trade-off between readability and performance is that for small usages or templates with 100 transclusions, it's fine to be as readable as possible but for a template with 10M transclusions, performance takes precedence over readability (and it's not editable by non-admins anyway). You can reduce the issue by using comments as mentioned above.

The problem is that this is not the first time in the past week that performance took precedence over wiki editors’ needs. Just yesterday you confirmed that you’ll practically break the linkscount gadget temporarily, now you want to break the readability of template source code permanently.

(and it's not editable by non-admins anyway)

Most of the high-used wrapper templates are not editable by non-admins. But we’re talking about the readability of templates using these wrappers, not the wrappers themselves. Out of the first 500 templates directly transcluding {{Description}}, only 77 are protected – 3 fully protected, 68 template editor protected, 6 semi-protected. (There are some false positives due to the regex finding both {{description (template) and {{{description (parameter), but most results are real.) Other templates probably have similar protection percentages.

I also do not know how to "simply" move 54 translations from Template:PD-old-text , another 50+ from Template:PD-US-expired-text and probably bunch of other similar templates. The manual approach seems quite laborious and potentially error-prone.

Translatewiki is also a wiki, it can be edited using bots (even though using bots is uncommon there). Or they can be imported together with the English text through Gerrit.

I do not know what that means . File a ticket at https://phabricator.wikimedia.org/tag/wikimediamessages/ or something. If anybody is familiar with this process please contact me.

Yes, license translations are handled in WikimediaMessages. If you’re unfamiliar with the process, creating a task is probably the best solution. (The other solution would be directly sending a patch to Gerrit.)


To get rid of a few tens of millions rows, I started working on T224810: Magic word to embed the translation language outside of translation units, which would allow replacing the hack Module:Caller title (22,415,811 transclusions) with an easier-to-understand and more performant solution that doesn’t involve template transclusion.

On finding templates that could be merged together. Since mediawiki doesn't record dependency graph of template usages (to figure out for example, template A is almost exclusively called by template B and they are transcluded in 100M pages so we can merge them and reduce a lot of rows from templatelinks) so we have to get creative and sacrifice a bit of accuracy. I wrote a script to find intersection of template transclusions (on heavily called templates, more than 10M pages). I split them into two groups.

First group is more of subgroup/supergroup, if the size of intersection is >95% of the size of the smaller set (meaning a child template could be merged to the parent). The second group is set equality, the size of intersection is >95% of size of both sets (meaning two templates used almost at the same time all the time and probably should be merged together).

Of course, this is not perfect and makes suggestions that are not correct (e.g. suggests merging Module:Information and Template:Information which technically is true but you know, not possible). But maybe a couple of them can be useful?

First set (potential child-parent):

Template 1Transclusion count of template 1Template 2Transclusion count of template 2Size of intersection
828:ISOdate106,000,637828:DateI18n106,206,960106,141,711
10:Dir119,472,929828:DateI18n106,206,960104,091,148
10:Dir119,472,929828:ISOdate106,000,637104,057,010
828:DateI18n106,206,960828:Core106,155,612103,993,662
828:ISOdate106,000,637828:Core106,155,612103,930,099
10:Dir119,472,929828:Core106,155,612103,797,643
10:License_template_tag102,569,71810:Dir119,472,929102,697,538
10:License_template_tag102,569,718828:Core106,155,612102,373,359
10:Infobox_template_tag102,219,623828:DateI18n106,206,960102,346,298
10:Infobox_template_tag102,219,623828:ISOdate106,000,637102,346,298
10:Infobox_template_tag102,219,623828:Core106,155,612102,345,774
10:Infobox_template_tag102,219,62310:Dir119,472,929102,340,207
10:License_template_tag102,569,718828:DateI18n106,206,960102,339,449
10:License_template_tag102,569,718828:ISOdate106,000,637102,338,503
10:Infobox_template_tag102,219,62310:License_template_tag102,569,718102,325,693
828:DateI18n106,206,960828:Information/styles.css85,410,42385,577,540
828:Core106,155,612828:Information/styles.css85,410,42385,577,540
828:ISOdate106,000,637828:Information/styles.css85,410,42385,577,540
10:Infobox_template_tag102,219,623828:Information/styles.css85,410,42385,577,538
10:Dir119,472,929828:Information/styles.css85,410,42385,570,874
10:License_template_tag102,569,718828:Information/styles.css85,410,42385,556,393
828:Information85,074,443828:DateI18n106,206,96085,151,461
828:Information85,074,443828:Core106,155,61285,151,461
828:Information85,074,443828:ISOdate106,000,63785,151,461
10:Infobox_template_tag102,219,623828:Information85,074,44385,151,447
10:Dir119,472,929828:Information85,074,44385,150,526
10:License_template_tag102,569,718828:Information85,074,44385,149,738
10:Information84,836,880828:DateI18n106,206,96084,913,581
10:Information84,836,880828:Core106,155,61284,913,581
10:Information84,836,880828:ISOdate106,000,63784,913,581

Second set (potential sibling/merger)

Template 1Transclusion count of template 1Template 2Transclusion count of template 2Size of intersection
828:ISOdate106,000,637828:DateI18n106,206,960106,141,711
828:DateI18n106,206,960828:Core106,155,612103,993,662
828:ISOdate106,000,637828:Core106,155,612103,930,099
10:License_template_tag102,569,718828:Core106,155,612102,373,359
10:Infobox_template_tag102,219,623828:DateI18n106,206,960102,346,298
10:Infobox_template_tag102,219,623828:ISOdate106,000,637102,346,298
10:Infobox_template_tag102,219,623828:Core106,155,612102,345,774
10:License_template_tag102,569,718828:DateI18n106,206,960102,339,449
10:License_template_tag102,569,718828:ISOdate106,000,637102,338,503
10:Infobox_template_tag102,219,62310:License_template_tag102,569,718102,325,693
10:Information84,836,880828:Information85,074,44384,913,581
10:SDC_statement_exist74,783,132828:SDC_tracking79,969,84174,485,361
10:SDC_statement_has_value73,238,850828:SDC_tracking79,969,84173,309,424
10:Autotranslate68,536,728828:Autotranslate71,302,22168,637,872
828:Description63,233,74310:Description67,344,36963,310,219
828:Coordinates48,407,895828:I18n/coordinates48,407,92548,490,147
10:Self36,611,44510:Self/is-pd-expired36,611,44836,634,602
828:Complex_date24,850,257828:I18n/complex_date24,850,26324,918,341
828:Authority_control19,982,815828:Authority_control/conf19,982,82020,035,618
828:Wikidata_date19,981,221828:Calendar19,981,23120,034,024
828:Authority_control/conf19,982,820828:City20,906,78720,031,955
828:Authority_control19,982,815828:City20,906,78720,031,955
828:Calendar19,981,231828:City20,906,78720,031,908
828:Calendar19,981,231828:Authority_control/conf19,982,82020,031,908
828:Calendar19,981,231828:Authority_control19,982,81520,031,908
828:Wikidata_date19,981,221828:City20,906,78720,031,908
828:Wikidata_date19,981,221828:Authority_control/conf19,982,82020,031,907
828:Wikidata_date19,981,221828:Authority_control19,982,81520,031,907
828:Message_box19,419,097828:Yesno20,086,59319,471,999
828:City20,906,787828:Linguistic20,235,36419,307,304
828:Authority_control/conf19,982,820828:Linguistic20,235,36419,304,483
828:Authority_control19,982,815828:Linguistic20,235,36419,304,483
828:Calendar19,981,231828:Linguistic20,235,36419,304,430
828:Wikidata_date19,981,221828:Linguistic20,235,36419,304,429
828:Linguistic20,235,364828:NationAndOccupation/nationalityLUT19,230,88419,283,529
828:Linguistic20,235,364828:NationAndOccupation/CountryAdjective2iso19,230,88419,283,529
828:NationAndOccupation/nationalityLUT19,230,884828:NationAndOccupation/CountryAdjective2iso19,230,88419,283,529
828:NationAndOccupation19,230,883828:Linguistic20,235,36419,283,528
828:NationAndOccupation19,230,883828:NationAndOccupation/nationalityLUT19,230,88419,283,528
828:NationAndOccupation19,230,883828:NationAndOccupation/CountryAdjective2iso19,230,88419,283,528
828:Name19,229,141828:I18n/name19,229,14119,281,785
828:Name19,229,141828:Linguistic20,235,36419,280,963
828:I18n/name19,229,141828:Linguistic20,235,36419,280,963
828:Name19,229,141828:City20,906,78719,280,766
828:I18n/name19,229,141828:City20,906,78719,280,766
828:Name19,229,141828:NationAndOccupation/nationalityLUT19,230,88419,280,759
828:Name19,229,141828:NationAndOccupation/CountryAdjective2iso19,230,88419,280,759
828:Name19,229,141828:NationAndOccupation19,230,88319,280,759
828:I18n/name19,229,141828:NationAndOccupation/nationalityLUT19,230,88419,280,759
828:I18n/name19,229,141828:NationAndOccupation/CountryAdjective2iso19,230,88419,280,759

Hope that'd be useful. Not that numbers can be weird (e.g. intersection being bigger than one of the sets) and that's fine, they are from different sources and some level of inaccuracy is baked in.

Some nerdy details on how to reproduce this (not related to the ticket itself): It was P55507 running stat1005 machine. It takes a LOT of memory for large wikis, peaked at 250GB but thankfully that host had 512GB memory. I wrote a version with bloom filter which sacrifices a bit of accuracy in favor of less memory consumption for running in case stat machines memory wouldn't be enough: P55508. But it's really really slow. Took a day for my home wiki which isn't that big.

I have never run into any pages in MediaWiki namespace managed by Translate extension, are there any pages using this feature I can look at?

Commons previously have pages in MediaWiki namespace marked for translation (such as https://commons.wikimedia.org/wiki/MediaWiki:Linkshere, which was able to be translated by non-admins), but no longer currently. See also T214741.
Wikidata currently has such page.

On finding templates that could be merged together ...

Landsgroup, Thank you for your table. I was just about to ask about something similar but was still thinking about what is needed. It just happen that I wrote or maintain most templates and modules in the first table and most in the second.

Looking at the first table, I think most overlaps are accidental, since every file uploded with Upload wizard has some common templates and modules (see for example today's upload):

  • Template:Information -> Module:Information + Module:Information/styles.css - default infobox
  • Template:Infobox template tag - do-nothing template used to find files with no infobox (like files missing Template:Information or with broken infobox). Such do-nothing tag templates were introduced in 2011 to replace maintenance categories used previously for similar tracking, as less visible and to reduce size of database table tracking categories. Perhaps some other way of tagging and searching for files missing this or that tag can be devised.
  • Template:Dir - used by most Commons templates to figure out language direction of the user
  • Module:DateI18n - internationalization of the date string
  • Module:ISOdate - parsing of the date string (usually YYYY-MM-DD but also few other supported formats)
  • Template:SDC_statement_exist, Template:SDC_statement_has_value -> Module:SDC_tracking - adds maintenance categories to find files missing SDC statements
  • Module:Core - collection of stripped down versions of most used lua functions used by Module:Information and many other Lua modules
  • Template:License template tag - do-nothing template used by bots to find files missing license. Mostly replicated by Extension:CommonsMetadata but still prefered as it gives more reliable results.
  • Template:En -> Template:Description -> Module:Description - mark language of a text

Some pairs definitely have child-parent or sibling relationship:

  • Template:En -> Template:Description -> Module:Description chain can be reduced to Template:En -> Module:Description, for a few high use templates.
  • Template:License template tag retire and use Extension:CommonsMetadata - started discussion about it. The difference between files found using the missing tag template vs. extension might be due to badly formatted license templates that need to be fixed.
  • Module:Information/styles.css is not a lua code but css file used by Extension:TemplateStyles - I am not sure why is it in the transclusion table
  • Module:DateI18n and Module:ISOdate - those 2 mostly go together as ISOdate calls DateI18n. Originally when I wrote them, they were in a single module but I split them in 2014, to allow better unit testing and reduce number of affected files that need to be updated each time I change DateI18n which is also used by other templates/modules. This code is used in few hundreds other projects.
  • Above 2 and Module:Wikidata_date, Module:Calendar, Module:Complex_date - are all codes related to dates. They call each other but they are quite separate.
  • Template:Infobox_template_tag and Module:Information - the tag could be removed from Module:Information and searches for files with missing/broken infobox can be done by a search for files without the tag or Module:Information
  • Template:Self and Template:Self/is-pd-expired - Template:Self is one of the high use templates not written in Lua yet and it calls many other templates. Template:Self/is-pd-expired for example is called 5 times. It can be optimized by a rewrite which will add Module:Self dependency.
  • use of Module:city by other modules is and anachronism and can be probably eliminated. I will look into it.

There are some inefficiencies with most modules being called from wikitext through template shell that does not do anything other than provide convenient place for documentation. Those are minimized when lua codes call each other directly. Also many Commons modules keep translations in a separate "module:I18n/..." to clearly separate the code from the translations. Some of those translations can be moved to translatewiki if there was some easy way to do it.
Most high-use lua codes try to be small and modular with minimum number of dependencies. Some submodules can be moved from always-load to load-as-needed schema, but most are always needed. For years most proposals to make Module:Information or other high-use modules more complicated were rejected based on efficiency arguments, to avoid loading bunch of rarely used code for each file, in order to add some feature or check affecting small percentage of files. We were not thinking about database efficiency, but how much code is loaded for each file, but those 2 optimization criteria go together. Some modules like Message_box, YesNo, etc. are transplants from other wikis maintained by different groups of people and unlikely to be combined. Many of the above modules are transplanted to great many other wikis and changes to Commons might affect them too.

Some ways to minimize amount of templates/modules loaded:

e.g. One case I see is that merging https://commons.wikimedia.org/wiki/Template:Self/is-pd-expired into https://commons.wikimedia.org/wiki/Template:Self can remove 36M rows from templatelinks.

This would, again, hurt readability a lot. However, merging https://commons.wikimedia.org/wiki/Template:Self/is-pd-expired, https://commons.wikimedia.org/wiki/Template:License_migration_is_redundant and https://commons.wikimedia.org/wiki/Template:License_migration_is_redundant_multiple in one Lua module could save some 8.5M links (assuming all pages using {{License migration is redundant}} also use {{License migration is redundant multiple}} and all pages using {{License migration is redundant multiple}} also use {{Self/is-pd-expired}}) while not hurting readability if not even improving it.

Another way to reduce size of transclude table used in the past was to run a bot job of substituting template redirects. Does anybody have a good SQL query to see if we have any templates with often used redirects? I run many such bot jobs in the past but not in the last decade or so.

Another way to reduce size of transclude table used in the past was to run a bot job of substituting template redirects. Does anybody have a good SQL query to see if we have any templates with often used redirects? I run many such bot jobs in the past but not in the last decade or so.

https://quarry.wmcloud.org/query/12286 First field is the redirect, second field is redirect target, third field is transclusions of redirect.

Another way to reduce size of transclude table used in the past was to run a bot job of substituting template redirects. Does anybody have a good SQL query to see if we have any templates with often used redirects? I run many such bot jobs in the past but not in the last decade or so.

https://quarry.wmcloud.org/query/12286 First field is the redirect, second field is redirect target, third field is transclusions of redirect.

Thanks, so I see that we can eliminate 41M transclusions by replacing all template redirects or 31M by replacing redirects of top 50 templates. Most used redirecting template is Template:Location_dec, which I merged with Template:Location a decade ago. I guess I never finished the cleanup afterwards.

Maybe someone knows how to set up a bot scheduled to run with some frequency finding and replacing names of substituted templates. My approach of AWB task with string replace is not the most efficient way to do it.

I eliminated the use of Template:Description in language templates, like Template:En in favor of calling Module:Description directly. It actually makes much more readable wikicode as Module:Description uses named variables as opposed to numbered variables. That should (eventually) eliminate 66.6M templatelinks entries.

I also started replacing some template redirects, but those savings in DB size come at the cost of millions of file edits. I am using pywikibot on https://hub-paws.wmcloud.org and at current rate of editing, with throttle set to 0, I might be done in 9 months...

Thanks. I don't recommend doing 9m edits to remove 9m rows from templatelinks, it just adds the same number but larger rows to another table. Maybe if it could be batched with something else, sure but not on its own.

But generally, thank you so much for cleaning these up and improve the database stability!

I actually missed this:

That’s a good question. As far as I can tell from looking at a random file on wikidata.org and on an external wiki with InstantCommons, both the TemplateStyles and Commons’ MediaWiki:Filepage.css are loaded on other wikis (Filepage code here), whereas its common.css isn’t loaded. So if we want to keep the behavior consistent, I guess MediaWiki:Filepage.css makes more sense. WDYT @Ladsgroup?

Than yes, please move it to Filepage.css

I specced out this proposal in some more detail here: https://commons.wikimedia.org/wiki/Module_talk:Information#Proposal:_Move_template_styles_to_site-wide_CSS

I just made this change to Module:Information, so most of the 87 million templatelinks to Module:Information/styles.css‏‎ should start to disappear now.

Thanks. I don't recommend doing 9m edits to remove 9m rows from templatelinks, it just adds the same number but larger rows to another table.

Good point. I was trying to create SQL query to find template redirects transcluded in other templates but failed so far. There a single edit will fix great many pages. Can anybody with better SQL skills help?

I also wrote Module:Self which implements 7 templates called from Template:Self including already discussed Template:Self/is-pd-expired. Template:Self is used on 36M pages.

Now looking at Module:Autotranslate called from Template:Autotranslate, which in turn is only called from other templates. I think we could call Module:Autotranslate directly from the other templates, without Template:Autotranslate for high volume templates. This would have a nice side-effect of avoiding issues described at T196464: Incorrect template loop detection

Thanks. I don't recommend doing 9m edits to remove 9m rows from templatelinks, it just adds the same number but larger rows to another table.

Good point. I was trying to create SQL query to find template redirects transcluded in other templates but failed so far. There a single edit will fix great many pages. Can anybody with better SQL skills help?

I'll give it a try.

I also wrote Module:Self which implements 7 templates called from Template:Self including already discussed Template:Self/is-pd-expired. Template:Self is used on 36M pages.

Thank you so much! it looks super nice. Once the updates trickle in, I'll do an optimize table of templatelinks to reclaim some space.

Now looking at Module:Autotranslate called from Template:Autotranslate, which in turn is only called from other templates. I think we could call Module:Autotranslate directly from the other templates, without Template:Autotranslate for high volume templates. This would have a nice side-effect of avoiding issues described at T196464: Incorrect template loop detection

Let me know how it goes.

Change 997425 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Schema: Drop iwl_title from one of indexes

https://gerrit.wikimedia.org/r/997425

Now looking at Module:Autotranslate called from Template:Autotranslate, which in turn is only called from other templates. I think we could call Module:Autotranslate directly from the other templates, without Template:Autotranslate for high volume templates. This would have a nice side-effect of avoiding issues described at T196464: Incorrect template loop detection

Let me know how it goes.

I think I replaced most calls from templates to Template:Autotranslate, (also Template:Autotranslate/clone 1 and few more clones needed due to T196464 issue) with calls directly to Module:Autotranslate. In a month or two when the DB catches up with the change we will see if I missed any.

one way you can reduce future usage of redirects is make TemplateWizard and visualeditor insert the target whenever a redirect is selected.

Change 997425 merged by jenkins-bot:

[mediawiki/core@master] Schema: Drop iwl_prefix_from_title from iwlinks

https://gerrit.wikimedia.org/r/997425

Note: once T61245: Review the PageNotice extension for deployment is resolved, we can include some common wikitext (though rendered differently in different pages) in all file pages, whose content will not be treated as template link, so it will not be updated when the common message is changed.

Potentially we can automatically include an license template, or even infobox, but such will be confusing when there are still manual infobox and license template in wikitext. One solution is hide the automatic template (display:none) by default, and show them via TemplateStyles once the manual ones are removed; I am not sure whether it works.

After some discussion we retired Template:License_template_tag which was used to track files without licenses and was redundant to Extension:CommonsMetadata which was introduced latter. That should eventually remove 100M template links.

Unfortunately I had to revert the removal of Template:License_template_tag. It appears that the UploadWizard extension tests for the presence of {{License template tag}} when a custom license is specified. See mw.UploadWizardLicenseInput.js and CommonSettings.php.

After some discussion we retired Template:License_template_tag which was used to track files without licenses and was redundant to Extension:CommonsMetadata which was introduced latter. That should eventually remove 100M template links.

That template is used for many different things that you probably just broke. I see uploadwizard and no license bot already. Please don't do this in such an uncontrolled way. Please check what is using it and switch it before trying again.

After some discussion we retired Template:License_template_tag which was used to track files without licenses and was redundant to Extension:CommonsMetadata which was introduced latter. That should eventually remove 100M template links.

That template is used for many different things that you probably just broke. I see uploadwizard and no license bot already. Please don't do this in such an uncontrolled way. Please check what is using it and switch it before trying again.

We did identify some tools which were using it in the past, but it seems like no all of them. It was unexpected that Commons:UploadWizard was relying on it, especially since there are other ways of tracking presence of license template.

Change 991921 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Use interwiki to link to Creative Commons

https://gerrit.wikimedia.org/r/991921

Change 991921 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMessages@master] Use interwiki to link to Creative Commons

https://gerrit.wikimedia.org/r/991921

Once this rolls out with the train next week, the number of external links to https://creativecommons.org (currently 146759278, or ~146.8 million) should start to go down gradually, as pages are re-parsed for various reasons and use the new version of the message with an interwiki link instead of an external link. (But we might have to actively mass-purge pages to get all of them to switch to interwiki links.)

I have a question for MediaWiki developers.

Currently Module:Autotranslate, with 72M uses, is used to transclude the best language subtemplate based on user language. It uses Lua's mw.language.getFallbacksFor function to get fallback language codes and mw.title.exists to check which language subpages exist and which do not. Lua function mw.title.exists is considered an expensive function.

User @Tacsipacsi proposed a rewrite which replaces calls to mw.title.exists with calls directly to frame:expandTemplate with pcall function wrapper used to catch errors, like non-existing page. The rewrite seems to work exactly like the current version, but is not using any "expensive" functions and no rows are created in the pagelinks table. So the question is, is the rewrite a good idea? We can write new Lua function mimicking mw.title.exists using frame:expandTemplate with pcall wrapper, which according to current metrics would be less expensive and have smaller BD footprint. However I find it hard to believe that expanding a template is less expensive than checking if it exists, and I am weary that maybe the issue is with the metrics we currently use for tracking expensive calls.

Once this rolls out with the train next week, the number of external links to https://creativecommons.org (currently 146759278, or ~146.8 million) should start to go down gradually, as pages are re-parsed for various reasons and use the new version of the message with an interwiki link instead of an external link. (But we might have to actively mass-purge pages to get all of them to switch to interwiki links.)

I’ve started cleaning up Module:Caller title usage (now that T224810 is ready), which causes a large number of reparses. To help cleaning up the externallinks table, I halt this cleanup on Commons until the train arrives, so that those reparses also remove the external links as a side effect.

However I find it hard to believe that expanding a template is less expensive than checking if it exists, and I am weary that maybe the issue is with the metrics we currently use for tracking expensive calls.

Expanding a template is not less expensive than checking if it exists; it’s less expensive than checking if it exists and expanding it. In this particular module, there are two cases:

  • The given language subpage exists. In this case, currently we manually check if it exists, then expand the template, which consists of a check whether the template exists (MediaWiki cannot know that we’ve already done that check) and actually expanding the template. With my proposed code, the manual check is skipped, which is obviously a net win.
  • The given language doesn’t exist. In this case, currently we manually check if it exists, and finding that it doesn’t, we don’t go further. With my proposed code, the manual check is skipped, so it’s the MediaWiki transclusion code that does the existence check and finds that it fails, reporting the error. It’s the same amount of existence checks, so it shouldn’t have worse performance.

Rewrite of Module:Autotranslate discussed went live last week.