Page MenuHomePhabricator

Get cleanupTitles.php into a good enough state that we could run it in production
Closed, ResolvedPublic

Description

From T195546#4237521:

I think the results are unacceptably bad in some cases, and we should not run the script before it is improved:

frwiki:  DRY RUN: would rename 1547147 (3,'195.175.037.8') to (3,'Broken/195.175.37.8')

Should rename to (0,'Broken/User:195.175.37.8') (or something), in main namespace, so that we can easily produce a list of affected pages (i.e. Special:PrefixIndex/Broken/).

test2wiki:  DRY RUN: would rename 2987 (447,'Columbia_University') to (0,'Columbia_University')

Should rename to (0,'Broken/NS447:Columbia_University') (or something), so that someone can find out the original namespace (in this case, EducationProgram).

bnwikisource:  DRY RUN: would rename 1251 (0,'WS:COPY') to (0,'Broken/COPY')

Where did the WS: prefix go?

avwiki:  DRY RUN: would rename 7541 (1,'Википедия:Requests_for_adminship') to (1,'Broken/\xd0\x92\xd0\xb8\xd0\xba\xd0\xb8\xd0\xbf\xd0\xb5\xd0\xb4\xd0\xb8\xd1\x8f\x3aRequests_for_adminship')

It's not acceptable to mangle perfectly normal characters (but not ASCII) like this.


Also, I think the prefix Broken/ is also unhelpful (and not translateable). The script should take an option to specify a prefix, so that we can e.g. set it to T195546/ or maybe 2018-06-29/ or some other vaguely useful thing that can be used to find out where that page suddenly appeared from.

Related Objects

Event Timeline

Vvjjkkii renamed this task from Get cleanupTitles.php into a good enough state that we could run it in production to fwbaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from fwbaaaaaaa to Get cleanupTitles.php into a good enough state that we could run it in production.Jul 2 2018, 2:52 AM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

Another ping... (it's been a year, and fomafix has been regularly updating his patch...)

Copying what I said in the mailing list:

Invisible characters are actually heavily used in many languages including Persian (and part of the official manual of style of the language taught in schools) it is downright wrong to check and fix those in many wikis in those languages.

Also many wikis have titles in other languages such as wiktionaries or redirects in a different languages (For example: https://en.wikipedia.org/w/index.php?title=%D8%AA%D9%87%D8%B1%D8%A7%D9%86&redirect=no) which means removing ZWNJ or similar characters would be also unacceptable in English Wiktionary or English Wikipedia as well.

There are some exemptions though: Two invisible characters are wrong, or an invisible character at the end or beginning. But all of these are cases in Persian language and another language might actually allow that as well.

Pppery subscribed.

See also the very old T18839 (as another test case to make sure my rewrite can handle it)

Change #1052196 had a related patch set uploaded (by Pppery; author: Pppery):

[mediawiki/core@master] Rewrite and add tests for cleanupTitles

https://gerrit.wikimedia.org/r/1052196

Change #1052196 merged by jenkins-bot:

[mediawiki/core@master] Rewrite and add tests for cleanupTitles

https://gerrit.wikimedia.org/r/1052196

The patch that fixes that specific issues reported in the task description is merged and has test cases. Anyone want to (once it's deployed) do another dry run of the cleanupTitles script and see if they're happy with the results? Otherwise I'll close this and actually running the script can be tracked in the parent task.

I would like to see results of the dry run in production before we close this. In 5 years since this task was filed, we easily could have introduced new problems other than the ones you fixed.

I'm running this now. I'll post the full log here once it's done, but one thing I'm noticing so far is that the progress indicator in the log goes a bit beyond 100% on some wikis (to about 103%).

This script finished on July 23rd, but I forgot to check back in on it for a few days -- I apologize. The full log file was 841MB, so I've filtered out the progress indicators, which reduces it to "only" 12MB. Phabricator's paste feature still struggled with that much data, so I've published the log on Gist: https://gist.github.com/catrope/b6a246a50f9756b0a31741f7db100bd7 . The vast majority of the log comes from hewikisource (85,294 lines of the 94,030 lines total).

That would be T314733 - hewikisource deleted a namespace without emptying any redirects from it first. 6 months ago they were willing to run an adminbot to delete all the remnants once they became accessible (T298430#9503312). Hopefully they still are.

Things like this are why I tagged User-notice on the parent task, as each wiki will want to review its broken titles once they are rescued.

There's at least one oddity I found while reviewing the output - on non-English wikis it would rename broken pages to English namespaces, and renames broken pages in the project/project talk namespace to "Broken/Project talk:Foo" rather than "Broken/Wikipedia talk:Foo". I'm not sure whether we care.

There are not any instances of "Broken/id:" titles (the ultimate failsafe when it can't find any vaguely related broken name to put it at). Nor do there appear to be any instances of non-ASCII titles getting mangled (which the script will do as a second-to-last resort in certain corner cases like invalid UTF8). Good.

It doesn't try to move "Talk:Project:X" to "Project talk:X" even if there's no conflict, instead moving it to "Broken/Talk:Project:X". I'm not sure whether we care.

On the other hand that behavior might be right in some cases, for instance if someone manually creates "WP:Foo" as a redirect to "Wikipedia:Foo" creates a talk page banner announcing it's a redirect, "Wikipedia:Foo" doesn't have a talk page, and then later "WP:" becomes an alias you might not want the banner.

There are a lot of these. I would be interested in hearing what others think should happen to them.

It renames "Wb:foo" -> "Broken/Project:foo" if "wb:" is a namespace alias for "Project:". Would "Broken/wb:foo" be more correct?

Likewise for interwikis it canonicalizes the interwiki to all lowercase and no space after the colon first.

It doesn't try to move "Talk:Project:X" to "Project talk:X" even if there's no conflict, instead moving it to "Broken/Talk:Project:X". I'm not sure whether we care.

On the other hand that behavior might be right in some cases, for instance if someone manually creates "WP:Foo" as a redirect to "Wikipedia:Foo" creates a talk page banner announcing it's a redirect, "Wikipedia:Foo" doesn't have a talk page, and then later "WP:" becomes an alias you might not want the banner.

There are a lot of these. I would be interested in hearing what others think should happen to them.

I wonder if namespaceDupes.php, which is generally run when a new namespace was created, doesn't deal with talk pages properly.

That's everything I found from reviewing the output.

Probably the only thing worth caring about is the talk page case - the others are nice-to-haves but since the page will be at "Broken/<something>" anyway the wiki can figure out where to put it.

bnwikisource:  DRY RUN: would rename 1251 (0,'WS:COPY') to (0,'Broken/COPY')

from the task description was cleaned up by namespaceDupes.php in 2018 (T210472). The new fate of the other issues in the task description:

frwiki:  page 1547147 (Discussion_utilisateur:195.175.037.8) doesn't match self.
frwiki:  DRY RUN: would rename 1547147 (3,'195.175.037.8') to (0,'Broken/User_talk:195.175.37.8')

test2wiki:  page 2987 (Special:Badtitle/NS447:Columbia_University) doesn't match self.
test2wiki:  DRY RUN: would rename 2987 (447,'Columbia_University') to (0,'Broken/NS447:Columbia_University')

avwiki:  page 7541 (БахӀс:Википедия:Requests_for_adminship) is illegal.
avwiki:  DRY RUN: would rename 7541 (1,'Википедия:Requests_for_adminship') to (0,'Broken/Talk:Википедия:Requests_for_adminship')

Change #1058215 had a related patch set uploaded (by Pppery; author: Pppery):

[mediawiki/core@master] CleanupTitles: Turn "Talk:Project:Foo" into "Project talk:Foo"

https://gerrit.wikimedia.org/r/1058215

Change #1058215 merged by jenkins-bot:

[mediawiki/core@master] CleanupTitles: Turn "Talk:Project:Foo" into "Project talk:Foo"

https://gerrit.wikimedia.org/r/1058215

Can we close this now? Or do we want a second dry run to make sure that the patch above did what it should?

matmarex assigned this task to Pppery.

Yes, and thank you for reviewing that log!