Page MenuHomePhabricator

Get cleanupTitles.php into a good enough state that we could run it in production
Open, Needs TriagePublic

Description

From T195546#4237521:

I think the results are unacceptably bad in some cases, and we should not run the script before it is improved:

frwiki:  DRY RUN: would rename 1547147 (3,'195.175.037.8') to (3,'Broken/195.175.37.8')

Should rename to (0,'Broken/User:195.175.37.8') (or something), in main namespace, so that we can easily produce a list of affected pages (i.e. Special:PrefixIndex/Broken/).

test2wiki:  DRY RUN: would rename 2987 (447,'Columbia_University') to (0,'Columbia_University')

Should rename to (0,'Broken/NS447:Columbia_University') (or something), so that someone can find out the original namespace (in this case, EducationProgram).

bnwikisource:  DRY RUN: would rename 1251 (0,'WS:COPY') to (0,'Broken/COPY')

Where did the WS: prefix go?

avwiki:  DRY RUN: would rename 7541 (1,'Википедия:Requests_for_adminship') to (1,'Broken/\xd0\x92\xd0\xb8\xd0\xba\xd0\xb8\xd0\xbf\xd0\xb5\xd0\xb4\xd0\xb8\xd1\x8f\x3aRequests_for_adminship')

It's not acceptable to mangle perfectly normal characters (but not ASCII) like this.


Also, I think the prefix Broken/ is also unhelpful (and not translateable). The script should take an option to specify a prefix, so that we can e.g. set it to T195546/ or maybe 2018-06-29/ or some other vaguely useful thing that can be used to find out where that page suddenly appeared from.

Event Timeline

Vvjjkkii renamed this task from Get cleanupTitles.php into a good enough state that we could run it in production to fwbaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from fwbaaaaaaa to Get cleanupTitles.php into a good enough state that we could run it in production.Jul 2 2018, 2:52 AM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

Another ping... (it's been a year, and fomafix has been regularly updating his patch...)

Copying what I said in the mailing list:

Invisible characters are actually heavily used in many languages including Persian (and part of the official manual of style of the language taught in schools) it is downright wrong to check and fix those in many wikis in those languages.

Also many wikis have titles in other languages such as wiktionaries or redirects in a different languages (For example: https://en.wikipedia.org/w/index.php?title=%D8%AA%D9%87%D8%B1%D8%A7%D9%86&redirect=no) which means removing ZWNJ or similar characters would be also unacceptable in English Wiktionary or English Wikipedia as well.

There are some exemptions though: Two invisible characters are wrong, or an invisible character at the end or beginning. But all of these are cases in Persian language and another language might actually allow that as well.

Pppery subscribed.

See also the very old T18839 (as another test case to make sure my rewrite can handle it)

Change #1052196 had a related patch set uploaded (by Pppery; author: Pppery):

[mediawiki/core@master] Rewrite and add tests for cleanupTitles

https://gerrit.wikimedia.org/r/1052196

Change #1052196 merged by jenkins-bot:

[mediawiki/core@master] Rewrite and add tests for cleanupTitles

https://gerrit.wikimedia.org/r/1052196

The patch that fixes that specific issues reported in the task description is merged and has test cases. Anyone want to (once it's deployed) do another dry run of the cleanupTitles script and see if they're happy with the results? Otherwise I'll close this and actually running the script can be tracked in the parent task.

I would like to see results of the dry run in production before we close this. In 5 years since this task was filed, we easily could have introduced new problems other than the ones you fixed.

I'm running this now. I'll post the full log here once it's done, but one thing I'm noticing so far is that the progress indicator in the log goes a bit beyond 100% on some wikis (to about 103%).