Page MenuHomePhabricator

Normalize globalimagelinks table
Open, MediumPublic

Description

This table is in commonswiki only and it's one of the fastest growing ones there + one of the biggest ones (~80GB). Looking at the structure:

MariaDB [commonswiki_p]> select * from globalimagelinks limit 4;
+--------------+----------+-----------------------+--------------------+---------------------------------------------+------------+
| gil_wiki     | gil_page | gil_page_namespace_id | gil_page_namespace | gil_page_title                              | gil_to     |
+--------------+----------+-----------------------+--------------------+---------------------------------------------+------------+
| fawiki       |  3373809 |                     2 | کاربر              | سیدانور_امید                                |            |
| wikidatawiki | 12741005 |                     2 | User               | Sun0769                                     | !          |
| enwiki       | 38395207 |                     2 | User               | Ethan_casals/sandbox                        | !!         |
| enwiki       |  6353628 |                     4 | Wikipedia          | Media_copyright_questions/Archive/Archive_3 | !!!!!!.jpg |
+--------------+----------+-----------------------+--------------------+---------------------------------------------+------------+
4 rows in set (0.00 sec)

Ignoring the fact that it doesn't have PK, the gil_wiki can be normalized, gil_page_namespace, gil_page_title and gil_to can also be normalized. “User_talk” has repeated 21M times, “Talk” 20M , “Discussioni_utente” 11M , I can go on for hours.

Fixing these should be waaay easier than doing things like T222224: RFC: Normalize MediaWiki link tables since it's just an extension

Event Timeline

Marostegui triaged this task as Medium priority.Dec 19 2019, 6:59 AM
Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.

+1 to normalize it, specially given how big it is and the fact that it doesn't have a PK, which makes operations even harder.
Do you have an estimation on how much dev time it would take to get it normalized?

dev time it should not be much as it's contained in one extension.

I wonder (and this is just an idea), if we could create a titles table as part of this extension (or a different name, if people don't like that first gu_titles), and make it a proof of concept for a later potential migration of that to core and other extensions? This would be a nice extension, too, because it would have to solve, not only local titles, but also references of local titles and namespaces (which was one of the things that some were skeptical about).

I am not saying that should happen, maybe your scope was less ambitious, but it would be a win to have something working on production to "sell it" to other extensions & groups and eventually core. It would also present similar problems (like indexing split now into several joins) that would have to be solved also for other code locations.

In summary, I would support strongly this, even if it would take some iterations to get it right, with a pretty, relatively generic and reusable api, as a pilot for deeper refactoring later down the line. Even if it didn't go to core, I know several extensions that work frequently with titles, even on wmde, that later could benefit from this, and maybe they could also contribute (needs some research).

I add @daniel and @Fjalapeno from CPT to take a look at this. Maybe it can be prioritized and done easily.

Tagging Platform Engineering for triage.

This is not a quick job, it's a technical infrastructure project. Not a big one, but a project in its own right.

the gil_wiki can be normalized, gil_page_namespace, gil_page_title and gil_to can also be normalized.

gil_wiki can use a small table managed NameTableStore. gil_page_namespace could too maybe, or it could be the namespace ID - the actual namespace name would have to be looked up per wiki to generate a link. gil_to could reference a file ID (does that exist now?). gil_page_title would need the same kind of title normalization as the links tables. What's the status of that?

Also, we should probably investigate whether a two-tiered system would work; we do this to scale entity usage tracking.

Anomie added a subscriber: Anomie.

gil_page_namespace could too maybe, or it could be the namespace ID - the actual namespace name would have to be looked up per wiki to generate a link.

That's the trick: currently as far as I know there's no straightforward way to map namespace names to IDs across wikis. For core namespaces the canonical name (e.g. "Project") could be used, but that doesn't work for custom namespaces or namespaces from extensions not also installed locally.

gil_to could reference a file ID (does that exist now?).

There isn't currently a "file ID". T28741 is probably the relevant task, although its scope is wider than just adding a PK. Most of the time you could use the page_id of the corresponding file description page as a proxy, but it's possible for a file to exist without a description page.

But that may not satisfy the need if we also need to track attempted links to images that don't exist. Do we?

gil_page_title would need the same kind of title normalization as the links tables. What's the status of that?

Note that gil_page_title refers to the foreign wiki, while T222224 is talking about a table for the local wiki. Mixing the two may not make much sense.


In the current schema, gil_wiki and gil_page suffice to identify the foreign page, while gil_to identifies the local file being used. gil_page_namespace_id, gil_page_namespace, and gil_page_title are all denormalized from gil_wiki+gil_page, most likely so the extension doesn't have to connect to every wiki's database to produce a listing like the one at https://commons.wikimedia.org/wiki/File:Example.svg#globalusage.

Naively, it seems like we could normalize this into three tables:

  • Something mapping an ID to the local file, replacing gil_to. That could be a PK on the images table or it could be the titles table from T222224, mostly depending on whether we need to track usages of files that don't exist.
    • Note neither of these tables belong to the extension, they belong to core.
  • A table mapping ID → gil_wiki + gil_page, plus the denormalized gil_page_names=pace_id, gil_page_namespace, and gil_page_title if they're still needed (which they likely are).
    • We could go further and have a table mapping an ID to gil_wiki, and use that ID in this table. That could be a simple NameTableStore just for this extension, or we could tie it into a whole nest of tasks about "site IDs" (see T224020#5317911 for a starting point).
    • We could possibly also go further and have a table assigning an ID to gil_page_namespace, or just mapping gil_wiki(_id)+gil_page_namespace_id to gil_page_namespace.
  • A table with a bidirectional mapping between the IDs from the other two tables.
WDoranWMF added a subscriber: WDoranWMF.

Moving to feature request review but may be destined for future initiatives to be prioritised

LSobanski added a subscriber: LSobanski.

Removing the DBA tag and subscribing myself instead. Once there are specific actions for DBA please re-add us and/or @mention me.