Page MenuHomePhabricator

Include linktarget data in public dumps
Closed, ResolvedPublicFeature

Description

Feature summary (what you would like to be able to do and where):

It should be possible to download a sanitized database dump of the linktarget table (see T305064) from dumps.wikimedia.org or a mirror site. Alternatively, the lt_namespace and lt_title columns of the linktarget table should be added to the templatelinks dumps, and in the future the other links table dumps. (The exact dump format doesn't matter too much; SQL, CSV, TSV, XML, and JSON probably would all be OK.)

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):

The tl_namespace and tl_title columns of the templatelinks table are being dropped. In a message posted to wikitech-l ("Changes in schema of MediaWiki links tables"), @Ladsgroup said:

So if you: [...] … rely on dumps of these tables [the links tables], you will need to change your scripts.

Currently, the contents of the linktarget table are not part of the publicly available SQL dumps (there are no references to "linktarget" in the dumps generation code), so it may not be possible to fix existing scripts that process dumps of the links tables. (While I do not have any such scripts, others might.)

Benefits (why should this be implemented?):

This would allow scripts to continue to extract usable data from the publicly available templatelinks SQL dumps. Without the target titles, the templatelinks data are largely meaningless.

Event Timeline

Change 822631 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] snapshot: Add linktarget

https://gerrit.wikimedia.org/r/822631

I can't generate my wanted templates list anymore. The list generator (written with this Rust library) relies on page.sql and templatelinks.sql and now requires linktarget.sql in order to work. The dumps should have been part of the roadmap for this database change, but I guess not many people write database queries against the dump files as I do, so it's a bit of an afterthought.

The clean up of linktarget which would unblock this is pretty high priority in my todo list. I hope to get to it very soon.

The clean up of linktarget which would unblock this is pretty high priority in my todo list. I hope to get to it very soon.

The maint script that cleans it is already merged and deployed. I'm planning to do a run today/tomorrow and then it should be fine to start making public dumps. cc @ArielGlenn

Ran it on all of group0 and group1, will move forward to group2 tomorrow.

How often would this be run going forward? I'm assuming pruning would have to be done periodically.

Yes it will be done weekly or biweekly.

It's being ran on rest of wikis as well. Can we merge the patch so users can have linktarget dumps in the next run?

Change 822631 merged by ArielGlenn:

[operations/puppet@production] snapshot: Add linktarget

https://gerrit.wikimedia.org/r/822631

@Ladsgroup @ArielGlenn Thank you for adding linktarget.sql! I'm able to generate my wanted templates list once again.

Ladsgroup claimed this task.

Awesome.